Organised by:


The entire body of research literature is currently estimated at 100-150 million publications with an annual increase of around 1.5 million. Research literature constitutes the most complete representation of knowledge we have assembled as human species. It enables us to develop cures to diseases, solve difficult engineering problems and answer many of the world’s challenges we are facing today. Systematically reading and analysing the full body of knowledge is now beyond the capacities of any human being. Consequently, it is important to better understand how we can leverage Natural Language Processing/Text Mining techniques to aid knowledge creation and improve the process by which research is being done.

This workshop aims to bring together people from different backgrounds who:

  1. have experience with analysing and mining databases of scientific publications,
  2. develop systems that enable such analysis and mining of scientific databases (especially those who publication databases) or
  3. who develop novel technologies that improve the way research is being done.


The topics of the workshop will be organised around the following themes:

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text and data mining.
  3. Analysis of large databases of scientific publications to identify research trends, high impact and improve access to research content.

Topics of interest relevant to theme 1 include but are not limited to:

  • Infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs for accessing scientific publications and/or research data. The existence of datasets, services, systems and APIs (in particular those that are open) providing access to large volumes of scientific publications and research data, is an essential prerequisite for being able to research and develop new technologies that can transform the way people do research. We invite papers presenting innovative approaches to the development of these systems that enable people to access databases and carry out their analysis. Papers addressing Open Access are of special interest. We also welcome submissions discussing the technical aspects of supporting Open Science, in particular reproducibility of research, sharing of scientific workflows and linking research data with publications. Finally, we also invite papers discussing issues and current challenges in the design of these systems.

Topics of interest relevant to theme 2 include but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure, etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to the aspect of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of a special interest.
  • Semantically enriching/annotating publications by crowdsourcing Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, other relevant crowdsourcing topics relevant to the domain of scientific publications.

Topics of interest relevant to theme 3 include but are not limited to:

  • New methods, models and innovative approaches for measuring impact of publications. The most widely used metrics for measuring impact are based on citations. However, counting citations not taking into account the publication content and the qualitative nature of the citation. In addition, there is a delay between the publication and the measurable impact in citations. We in particular encourage papers addressing new ways of using textual resources for evaluating publications’ importance, such as based on the ideas of detecting (textual) novelty/contribution of works or automatic classification of citation types, sentiment or influence.
  • New methods for measuring performance of researchers or research groups. Methods for assessing impact of a publication can be often extended to methods that can assess the impact of individual researchers. However, there are also other criteria for measuring impact in addition to publications, such as the development and publication of research data, economical and market impact that should also be taken into account. We welcome papers addressing these aspects.
  • Methods for identifying research trends and cross-fertilization between research disciplines. Identifying research trends should allow discovering newly emerging disciplines or it should help to explain why certain fields are attracting the attention of a wider research community. Such monitoring is important for research funders and governments in order to be able to quickly respond to new developments. We invite papers discussing new methods for identifying trends and cross-fertilization between research disciplines using methods ranging from social network analysis and text- and data-mining to innovative visualization approaches.
  • Applications and case studies of mining from scientific databases and publications. New methods and models developed for mining from scientific publications can be applied in many different scenarios, such as improving access to scientific publications, providing exploratory search in digital collections, identifying experts. We encourage papers describing innovative approaches that use scientific publications and data to solve real-world (discipline-specific) problems.
  • Exploratory search and Recommender systems for research.This topic addresses research carried out to improve access to very large collections of research publications to improve the way research process is conducted.

Special Open Publications Dataset Track

This year we would like to invite the workshop participants to makes use of the CORE publications dataset containing over 8 million full texts of research papers from a wide variety of research areas. The dataset contains not only full-texts, but also an enriched version of publications' metadata. This dataset provides a framework for developing and testing methods and tools addressing the workshop topics. The use of this dataset is not mandatory, however it is encouraged. The dataset is available through the CORE portal.

In addition to offering the dataset we are also considering to run a shared task involving the use of the OpenMinTeD infrastructure for mining scientific papers.

Previous Organisation

WOSP has been the first workshop to address specifically the topic of mining scientific papers at a major conference. The 6 previous instances of WOSP were held in conjunction with the JCDL conferences.

Additionally, we have also organised the Workshop on Scholarly Web Mining (SWM 2017), which was associated with WSDM 2017 in Cambridge, UK. The proceedings of the SWM 2017 workshop are available here.

All runs of the workshop have been extremely successful in terms of attracting submissions and participants from leading institutions in the area including Cambridge University, Microsoft, British Library, Elsevier, National Library of Medicine, Library of Congress, University of Pennsylvania (CiteSeerX), Know-Center Graz, University of Athens (OpenAIRE project) and Mendeley.

Submission Format

We invite submissions related to the workshop’s topics. Long papers should not exceed 8 pages and short papers should not exceed 4 pages of the LREC style. Furthermore, we welcome demo presentations of systems or methods. A demonstration submission should consist of a maximum two-page description of the system, method or tool to be demonstrated. All submissions will be uploaded to the START system for a peer-review.

The LREC proceedings template can be found on the LREC website. Papers should be submitted using the START system.

Important Dates

Wednesday, March 7, 23:59 (Hawaii time) — Submission deadline

Wednesday, March 14, 23:59 (Hawaii time) — Extended submission deadline

Saturday, April 7 — Notification of acceptance

Saturday, April 21 — Camera-ready

Monday, May 7 — Workshop

Keynote Speaker

Horacio Saggion

Large Scale Text Understanding Systems Lab, Natural Language Processing Group, Department of Information and Communication Technologies, Universitat Pompeu Fabra

Horacio Saggion is an Associate Professor at the Department of Information and Communication Technologies, Universitat Pompeu Fabra (UPF), Barcelona. He is the head of the Large Scale Text Understanding Systems Lab, associated to the Natural Language Processing group (TALN) where he works on automatic text summarization, text simplification, information extraction, sentiment analysis and related topics. Horacio obtained his PhD in Computer Science from Universite de Montreal, Canada in 2000. He obtained his BSc in Computer Science from Universidad de Buenos Aires in Argentina, and his MSc in Computer Science from UNICAMP in Brazil. He was the Principal Investigator for UPF in the EU projects Dr Inventor and Able-to-Include and is currently principal investigator of the national project TUNER and the Maria de Maeztu project Mining the Knowledge of Scientific Publications. Horacio has published over 150 works in leading scientific journals, conferences, and books in the field of human language technology. He organized four international workshops in the areas of text summarization and information extraction and was scientific Co-chair of STIL 2009 and scientific Chair of SEPLN 2014. He is a regular programme committee member for international conferences such as ACL, EACL, COLING, EMNLP, IJCNLP, IJCAI and is an active reviewer for international journals in computer science, information processing, and human language technology. Horacio has given courses, tutorials, and invited talks at a number of international events including LREC, ESSLLI, IJCNLP, NLDB, and RuSSIR. Mining and Enriching Multilingual Scientific Text Collections: Current Challenges and Opportunities

Scientists worldwide are confronted with an exponential growth in the number of scientific documents being made available, for example: Elsevier publishes over 250K scientific articles per year (or one every two minutes) and has over 7 million publications; MedLine, the most important source in biomedical research, contains 21 million scientific references, and the World Intellectual Patent Organization (WIPO) contains some 70 million records. All this unprecedented volume of information complicates the task of researchers who are faced with the pressure of keeping up-to-date with discoveries in their own disciplines and with the challenge of searching for innovation, new interesting problems to solve, checking already solved problems or hypothesis, or getting information on past and current available methods, solutions or techniques. At the same time and with the rise of open science initiatives and social media, research is more connected and open creating new opportunities but also challenges for the scientific community.

In this scenario of scientific information overload, natural language processing has a key role to play. Over the past few years we have seen a number of tools for the analysis of the structure of scientific documents (e.g. transforming PDF to XML), methods for extracting keywords, or classifying sentences into argumentative categories being developed. However, deep analysis of scientific documents such as: finding key claims, assessing the argumentative quality and strength of the research, or summarizing the key contributions of a piece of work are less common. Besides, most research in scientific text processing is being carried out for the English language, neglecting both the share of scientific information available in other languages and the fact that scientific publications are many times bilingual.

In this talk, I will present work carried out in our laboratory towards the development of a system for “deep” analysis and annotation of scientific text collection. Originally for the English language, it has now being adapted to Spanish. After a brief overview of the system and its main components, I will present our current work on the development of a bi-lingual (Spanish and English) fully annotated text resource in the field of natural language processing that we have created with our system together with a faceted-search and visualization system to explore the created resource.

With this scenario in mind I will speculate on the challenges and opportunities that the scientific field brings to our community not only in terms of language but also from the point of view of social media and science education.





Keynote Talk

Mining and Enriching Multilingual Scientific Text Collections: Current Challenges and Opportunities

Horacio Saggion




Long Paper

Scithon™ - An evaluation framework for assessing research productivity tools

Ronin Wu, Valentin Stauber, Viktor Botev, Jacobo Elosua, Anita Brede, Maria Ritola and Kaloyan Marinov


Long Paper

OpenMinTeD: A Platform Facilitating Text Mining of Scholarly Content

Penny Labropoulou, Dimitris Galanis, Antonis Lempesis, Mark Greenwood, Petr Knoth, Richard Eckart de Castilho, Stavros Sachtouris, Byron Georgantopoulos, Stefania Martziou, Lucas Anastasiou, Katerina Gkirtzou, Natalia Manola and Stelios Piperidis


Long Paper

Studying Uncertainty in Science: a distributional analysis through the IMRaD structure

Iana Atanassova, François-C. Rey and Marc Bertin


Poster Presentation

Exploring Textual and Social Hierarchies in Czech Sociological Articles

Radim Hladik




Long Paper

Data-driven Summarization of Scientific Articles

Nikola Nikolov, Michael Pfeiffer and Richard Hahnloser


Long Paper

Experiments in Detection of Implicit Citations

Ahmed AbuRa'ed, Luis Chiruzzo and Horacio Saggion


Short Paper

Goal-Oriented Representation of Scientific Papers

Jumana Nassour, Michael Elhadad and Arnon Strum


Short Paper

DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Christopher Stahl, Steven Young, Drahomira Herrmannova, Robert Patton and Jack Wells




Long Paper

Investigating Domain Features For Scope Detection and Classification of Scientific Articles

Tirthankar Ghosal, Ravi Sonam, Sriparna Saha, Asif Ekbal and Pushpak Bhattacharyya


Demo Paper

An End-to-End PDF Toolchain for Marking Up Scientific Documents

Sanna Hulkkonen and Oliver Ray



Organising Committee

Petr Knoth, Knowledge Media institute, The Open University, UK

Drahomira Herrmannova, Oak Ridge National Laboratory, USA

Richard Eckart de Castilho, Technische Universität Darmstadt, Germany

Programme Committee

Iana Atanassova, Université de Bourgogne Franche-Comté, France

Joeran Beel, Trinity College, University of Dublin, Ireland

Marc Bertin, Université Claude Bernard Lyon 1, France

Debsindhu Bhowmik, Oak Ridge National Laboratory, USA

Johan Bollen, Indiana University, USA

José Borbinha, Universidade de Lisboa, Portugal

Tanmoy Chakraborty, University of Maryland, USA

Daniel Duma, Alan Turing Institute, UK

Shang Gao, Oak Ridge National Laboratory, USA

Stephen Gilbert, Iowa State University, USA

C. Lee Giles, Pennsylvania State University, USA

Christopher G. Harris, SUNY Oswego, USA

Saeed Ul Hassan, Information Technology University, Pakistan

Monica Ihli, University of Tennessee, USA

Antoine Isaac, Europeana, The Netherlands

Roman Kern, Graz University of Technology, Austria

Martin Klein, Los Alamos National Laboratory, USA

Birger Larsen, Aalborg University Copenhagen, Denmark

Paolo Manghi, Italian National Research Council, Italy

Bruno Martins, Universidade de Lisboa, Portugal

Philipp Mayr, GESIS Leibniz Institute for the Social Sciences, Germany

Peter Mutschke, GESIS Leibniz Institute for the Social Sciences, Germany

Francesco Osborne, The Open University, UK

Robert M. Patton, Oak Ridge National Laboratory, USA

Eloy Rodrigues, Universidade do Minho, Portugal

Angelo Antonio Salatino, The Open University, UK

Pavel Smrz, Brno University of Technology, Czech Republic

Christopher G. Stahl, Oak Ridge National Laboratory, USA

Wojtek Sylwestrzak, University of Warsaw, Poland

Dominika Tkaczyk, Trinity College Dublin, Ireland

Ziqi Zhang, Nottingham Trent University, UK


Phoenix Seagaia Resort

Miyazaki Prefecture

Miyazaki, Japan

©7th International Workshop on Mining Scientific Publications. Design based on CSS Templates For Free.