Organised by:

open university


Digital libraries that store scientific publications are becoming increasingly central to the research process. They are not only used for traditional tasks, such as finding and storing research outputs, but also as a source for discovering new research trends or evaluating research excellence. With the current growth of scientific publications deposited in digital libraries, it is no longer sufficient to provide only access to content. To aid research, it is especially important to leverage the potential of text and data mining technologies to improve the process of how research is being done.

This workshop aims to bring together people from different backgrounds who: (a) are interested in analysing and mining databases of scientific publications, (b) develop systems that enable such analysis and mining of scientific databases (especially those who run databases of publications) or (c) who develop novel technologies that improve the way research is being done.


The topics of the workshop will be organised around thefollowing themes:

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text and data mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

Topics of interest relevant to theme 1 include, but are not limited to:

  • Infrastructures including repositories, aggregators, text- and data-mining facilities, impact monitoring tools, datasets, services and APIs for accessing scientific publications and/or research data. The existence of datasets, services, systems and APIs (in particular those that are open) providing access to large volumes of scientific publications and research data, is an essential prerequisite for being able to research and develop new technologies that can transform the way people do research. We invite papers presenting innovative approaches to the development of these systems that enable people to access databases and carry out their analysis. Papers addressing Open Access are of special interest. We also welcome submissions discussing the technical aspects of supporting Open Science, in particular reproducibility of research, sharing of scientific workflows and linking research data with publications. Finally, we also invite papers discussing issues and current challenges in the design of these systems.

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to the aspect of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of a special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, other relevant crowdsourcing topics relevant to the domain of scientific publications.

Topics of interest relevant to theme 3 include, but are not limited to:

  • New methods, models and innovative approaches for measuring impact of publications. The most widely used metrics for measuring impact are based on citations. However, counting citations not taking into account the publication content and the qualitative nature of the citation. In addition, there is a delay between the publication and the measurable impact in citations. We in particular encourage papers addressing new ways of evaluating publications’ impact beyond standard citation measures.
  • New methods for measuring performance of researchers. Methods for assessing impact of a publication can be often extended to methods that can assess the impact of individual researchers. However, there are also other criteria for measuring impact in addition to publications, such as the development and publication of research data, economical and market impact that should also be taken into account. We welcome papers addressing these aspects.
  • Evaluating impact of research groups. The same as for impact of individuals holds for research communities.
  • Methods for identifying research trends and cross-fertilization between research disciplines. Identifying research trends should allow discovering newly emerging disciplines or it should help to explain why certain fields are attracting the attention of a wider research community. Such monitoring is important for research funders and governments in order to be able to quickly respond to new developments. We invite papers discussing new methods for identifying trends and cross-fertilization between research disciplines using methods ranging from social network analysis and text- and datamining to innovative visualization approaches.
  • Application and case studies of mining from scientific databases and publications. New methods and models developed for mining from scientific publications can be applied in many different scenarios, such as improving access to scientific publications, providing exploratory search in digital collections, identifying experts. We encourage papers describing innovative approaches that use scientific publications and data to solve real-world problems.
  • Improving the infrastructure of repositories to support the development and integration of new impact and performance metrics. New ways of improving the repository infrastructure can include, for example, tracking accesses and downloads, researcher profiling and the interlinking of repository data with external services.. These can be in turn used for developing new impact metrics. We welcome papers addressing these issues.


We would like to invite the workshop participants to make use of the CORE publications dataset containing large volume of research publications from a wide variety of research areas. The dataset contains not only full-texts, but also an enriched version of publications’ metadata. The aim is to provide a framework for developing and testing methods and tools addressing the workshop topics. The use of this dataset is not mandatory, however it is encouraged. The dataset is now available through CORE portal here.


The workshop on Mining Scientific Publications aims to bring together researchers, digital library developers and practitioners from government and industry to address the current challenges in the domain of mining scientific publications.


The 1st International Workshop on Mining Scientific Publications was held in conjunction with JCDL 2012. The 2nd run of this workshop was held in conjunction with JCDL 2013. The 3rd run was especially popular and was associated with DL 2014 in London. The 4th run was held together with JCDL 2015. All runs of the workshop have been extremely successful in terms of attracting submissions and participants from leading institutions in the area including Cambridge University, British Library, Elsevier Labs, National Library of Medicine, Library of Congress, University of Pennsylvania (CiteSeerX), Know-Center Graz, University of Athens (OpenAIRE project) and Mendeley.


We invite submissions related to the workshop's topics. Long papers should not exceed 8 pages and short papers should not exceed 4 pages of the ACM style. Furthermore, we welcome demo presentations of systems or methods. A demonstration submission should consist of a maximum two page description of the system, method or tool to be demonstrated.

The ACM proceedings template can be found on the ACM website. Papers should be submitted using the EasyChair system provided here.

Successful submissions will be published in the D-Lib Magazine.

The 1st international workshop on mining scientific publications proceedings are available here.

The 2nd international workshop on mining scientific publications proceedings are available here.

The 3rd International Workshop on Mining Scientific Publications proceedings are available here.

The 4th International Workshop on Mining Scientific Publications proceedings are available here.


Yuxiao Dong, University of Notre Dame
Yuxiao Dong is a final-year Ph.D. student in Computer Science and Engineering and the Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, U.S. He also has been working with the Tsinghua AMiner team for six years. His research focuses on social networks, data mining, and computational social science, with an emphasis on applying computational models to addressing problems in large social systems, such as academic collaboration, mobile communication, and online social media. His research has been published in top data science conferences and interdisciplinary journals, and also won two best paper awards/nominations.
AMiner: toward understanding big scholar data
AMiner is the second generation of the ArnetMiner system. We focus on developing author-centric analytic and mining tools for gaining a deep understanding of the large and heterogeneous networks formed by authors, papers, venues, and knowledge concepts. One fundamental goal is how to extract and integrate semantics from different sources. We have developed algorithms to automatically extract researchers’ profiles from the Web and resolve the name ambiguity problem, and connect different professional networks. We also developed methodologies to incorporate knowledge from the Wikipedia and other sources into the system to bridge the gap between network science and the web mining research. In this talk, I will focus on answering two fundamental questions for author-centric network analysis: who is who? and who are similar to each other? The system has been in operation since 2006 and has collected more than 100,000,000 author profiles, 100,000,000 publication papers, and 7,800,000 knowledge concepts. It has been widely used for collaboration recommendation, similarity analysis, and community evolution.

Michael J. Kurtz, Harvard-Smithsonian Center for Astrophysics

Michael Kurtz is an astronomer and computer scientist at the Harvard-Smithsonian Center for Astrophysics in Cambridge, Massachusetts, which he joined after receiving a PhD in Physics from Dartmouth College in 1982. Kurtz is the author or co-author of over 300 technical articles and abstracts on subjects ranging from cosmology and extragalactic astronomy to data reduction and archiving techniques to information systems and text retrieval algorithms.

Kurtz is the founder and project scientist of the Smithsonian/NASA Astrophysics Data System (ADS) for which he won the van Biesbroeck prize of the American Astronomical Society. He has received the Citation research award from the American Society for Information Science; he is a fellow in the astrophysics section of the American Physical Society, and a fellow in the Information, Computing and Communication section of the American Association for the Advancement of Science.

He is on the board of directors of the Classification Society and the board of advisors of Force11. He is the moderator of the astrophysics Instrumentation and Methods section of arXiv, and is an editor of the Journal of the Association for Information Science and Technology.

List of publications via ADS: (h index 35)

List of publications via Google Scholar: (h index 40)

Wikipedia article:



ADS: The Joy of Text

The Smithsonian/NASA Astrophysics Data System (ADS) is one of the oldest web based scholarly information systems. Next year we will have been online for a quarter of a century. Today it contains metadata on more than 11 million articles, and the full text for 5 million, including nearly every refereed article in physics, astrophysics, or geophysics. The ADS is used daily by several tens of thousand scientists, including essentially every research astronomer on earth, as well as weekly to monthly by a few hundred thousand more students and researchers, and with occasional use by several million members of the general public.

With substantial help from its collaborators the ADS uses a plethora of techniques to build, maintain, and enhance its services. These include text mining of articles and meta-data; data mining of usage logs; the development and implementation of new bibliometric measures for papers, people, and organizations; semantic tagging, and the creation of links to external data sources; machine learning and text classification; recommender systems; real-time network analysis; and various user interface issues.

The ADS is available at, and a full featured API exists for developer and researcher use at


Stelios Piperidis, Athena RC/ILSP
Stelios Piperidis is senior researcher and Head of the Natural Language and Knowledge Extraction Department at the Institute for Language and Speech Processing/Athena RC. He is also the Director of the CLARIN:EL network, member of the META-NET Executive Board, supervisor of the META-SHARE infrastructure, member of the LREC Conference Programme Committee. His research interests include statistical and deductive methods in natural language processing, language resources and automatic linguistic knowledge elicitation, machine translation and philosophy of language. He has led ILSP’s participation in over 30 R&D projects in mono/multilingual and multimedia information processing. He has served as President of the European Language Resources Association (2008-2012). He is associated with the National Technical University and the National/Kapodistrian University of Athens where he teaches postgraduate courses on Logic and Language, Logic Programming and natural language processing systems.
Making sense of scientific textual content
Recent years witness an upsurge in the quantities of digital research data, offering new insights and opportunities for improved understanding. Text and data mining is emerging as a powerful tool for harnessing the power of structured and unstructured content and data, by analysing them in several dimensions to discover hidden and new knowledge. Text mining solutions are, however, not easy to discover and use, nor are they easily combinable by end users. In this talk we present OpenMinTeD, an infrastructural approach that fosters and facilitates the use of text mining technologies in the scientific publications world. OpenMinTeD builds on existing text mining tools and platforms, and renders them discoverable and interoperable through appropriate registries and a standards-based interoperability layer, respectively. We will discuss the merits of the approach as well as several use cases identified by scholars and experts from different areas, ranging from generic scholarly communication to scientific literature related to life sciences, food and agriculture, and social sciences and humanities.

Peter Mutschke, GESIS
Peter Mutschke is senior researcher at GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany. Since March 2010 he is acting head of the GESIS department “Knowledge Technologies for the Social Sciences” (WTS). Peters research interests include Information Retrieval, Network Analysis and Science 2.0. He worked in a number of national and international research projects such as the DFG-funded projects “Distributed Agents for User-Friendly Access of Digital Libraries (DAFFODIL) and "Value-Added Services for Information Retrieval" (IRM), the DELOS/NSF Working Group on “Reference Models for Digital Libraries”, and the EU-funded project “Where eGovernment meets the eSociety” (WeGov). Currently, Peter is involved in the EU-funded projects “Data Insights for Policy Makers & Citizens” (SENSE4US) and “Open Mining INfrastructure for TExt and Data (OpenMinTeD)”. Furthermore, Peter is involved in major national and European research networks such as the COST action “Analyzing the dynamics of information and knowledge landscapes” (KNOWeSCAPE) and the research alliance “Science 2.0” of the German Leibniz Association. For both research networks Peter serves as a member of the management committee.
Making sense of unstructured textual data to enhance information discovery and linking. Challenges and potential of text mining in scholarly IR
Scientists spend considerable time on searching relevant publications and research data, in particular relationships between data and literature. Thus, there is a considerable interest in automatic methods for a reliable recognition, disambiguation and linking as well as context-sensitive retrieval of relevant entities. Text mining is the process of extracting meaningful information in unstructured text and discovering hidden relationships between recognized items. OpenMinTeD aims at establishing an open and sustainable text mining infrastructure that makes text mining possible for everyone. The driving force of the OpenMinTeD project is its open science-oriented, researcher-centered approach to ensure that researcher communities’ needs are perfectly addressed. To achieve these aims, OpenMinTeD will develop and implement a number of use cases from different scientific areas. From the perspective of the social science use case, the talk focuses on major challenges as well as research highlights in information retrieval, information extraction and linking as well as knowledge mapping and its relevance to OpenMinTeD.


April 17th — Submission deadline

April 24th — New submission deadline

April 27th — New submission deadline

May 27th — Notification of acceptance

June 15th — Camera-ready

June 22nd (afternoon)-June 23rd (morning) — Workshop


Day 1


Registration and posters

Subject Area Visual Analytics of Scientific User Facility Publications

Robert Patton, Christopher Stahl, Chelsey Stahl, Thomas Potok and Jack Wells

Extracting biological knowledge from literature using SQL

Yannis Foufoulas, Anna Gogolou, Lefteris Stamatogiannakis, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis

Towards deeper level of scientific publications world

Marcin Skulimowski

Language infrastructures in support of text mining

Stelios Piperidis, Maria Gavrilidou and Penny Labropoulou





AMiner: toward understanding big scholar data

Yuxiao Dong


Long paper

Quantifying conceptual novelty in the biomedical literature

Shubhanshu Mishra and Vetle Torvik


Short paper

Capturing Interdisciplinarity from Academic Abstracts

Federico Nanni, Laura Dietz, Stefano Faralli, Goran Glavas and Simone Paolo Ponzetto


Break and posters (continued)


Invited talk

Making sense of scientific textual content

Stelios Piperidis


Short paper

Crawling Scientific Repositories: Challenges and Solutions for Automated Retrieval from Google Scholar and Co.

Philipp Meschenmoser, Manuel Hotz, Bela Gipp and Norman Meuschke



Extraction of Text from PDF Research Articles Using Font Analysis

Stephen Gilbert, Nirav Kamdar, Vijay Kalivarapu and Annette O'Connor



COBRA: Publication Discovery and Management System

Christopher Stahl, Robert Patton and Jack Wells


Social dinner

Day 2



ADS: The Joy of Text

Michael J. Kurtz


Long paper

Rhetorical Classification of Anchor Text for Citation Recommendation

Daniel Duma, Maria Liakata, Amanda Clare, James Ravenscroft and Ewan Klein


Short paper

Temporal Properties of Recurring In-text References

Marc Bertin and Iana Atanassova




Invited talk

Making sense of unstructured textual data to enhance information discovery and linking. Challenges and potential of text mining in scholarly IR

Peter Mutschke


Long paper

Measuring Scientific Impact Beyond Citation Counts

Robert Patton, Christopher Stahl and Jack Wells


Short paper

Preliminary Studies on the Impact of Literature Curation by Model Organism Databases on Article Citation Rates

Michael Lauruhn, Tanya Berardini, Leonore Reiser and Ronald Daniel




Long paper

The Impact of Academic Mobility on the Quality of Graduate Programs

Thiago Silva, Alberto Laender, Clodoveu Davis Jr, Ana Paula Silva and Mirella Moro


Long paper

An Analysis of the Microsoft Academic Graph

Drahomira Herrmannova and Petr Knoth








Petr Knoth, Knowledge Media institute, The Open University, UK

Drahomira Herrmannova, Knowledge Media institute, The Open University, UK

Lucas Anastasiou, Knowledge Media institute, The Open University, UK

Nancy Pontika, Knowledge Media institute, The Open University, UK


Pável Calado, Instituto Superior Técnico, Universidade de Lisboa, Portugal

Bradford Demarest, Indiana University Bloomington, USA

Iryna Gurevych, Darmstadt University of Technology, Germany

Antoine Isaac, Europeana & VU University Amsterdam, Netherlands

Roman Kern, Graz University of Technology, Austria

Martin Klein, Los Alamos National Laboratory, USA

Paolo Manghi, ISTI-CNR, Italy

Bruno Martins, Instituto Superior Técnico, Universidade de Lisboa, Portugal

Franco Maria Nardini, ISTI-CNR, Italy

Francesco Osborne, KMi, The Open University, UK

Eloy Rodrigues, Universidade do Minho, Portugal

Angelo Antonio Salatino, KMi, The Open University, UK

Pavel Smrz, Brno University of Technology, Czech Republic

Wojtek Sylwestrzak, ICM Univeristy of Warsaw, Poland

Vetle Torvik, University of Illinois at Urbana-Champaign, USA

Saeed Ul Hassan, Information Technology University, Pakistan

Ziqi Zhang, University of Sheffield, UK


Rutgers University

Newark, NJ


©5th International Workshop on Mining Scientific Publications. Design based on CSS Templates For Free. Design based on Free Website Templates.