Organised by:

The european library
open university
athena research centre

The 4th International Workshop on Mining Scientific Publications website is now available here.


Digital libraries that store scientific publications are becoming increasingly central to the research process. They are not only used for traditional tasks, such as finding and storing research outputs, but also as a source for discovering new research trends or evaluating research excellence. With the current growth of scientific publications deposited in digital libraries, it is no longer sufficient to provide only access to content. To aid research it is especially important to improve the process of how research is being done.

The recent development in natural language processing, information retrieval and the semantic web make it possible to transform the way we work with scientific publications. However, in order to be able to improve these technologies and carry out experiments, researchers need to be able to easily access and use large databases of scientific publications.

This workshop aims to bring together people from different backgrounds who: (a) are interested in analysing and mining databases of scientific publications, (b) develop systems that enable such analysis and mining of scientific databases or (c) who develop novel technologies that improve the way research is being done.


The topics of the workshop will be organised around the following themes:

  1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, s ervices and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

Topics of interest relevant to theme 1 include, but are not limited to:

  • Infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets,, services and APIs for accessing scientific publications and/or research data. The existence of datasets, services, systems and APIs (in particular those that are open) providing access to large volumes of scientific publications and research data, is an essential prerequisite for being able to research and develop new technologies that can transform the way people do research. We invite papers presenting innovative approaches to the development of these systems that enable people to access databases and carry out their analysis. Papers addressing Open Access are of special interest. We also welcome submissions discussing the technical aspects of supporting Open Science, in particular reproducibility of research, sharing of scientific workflows and linking research data with publications. Finally, we also invite papers discussing issues and current challenges in the design of these systems.

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to the aspect of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of a special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, other relevant crowdsourcing topics relevant to the domain of scientific publications.

Topics of interest relevant to theme 3 include, but are not limited to:

  • New methods, models and innovative approaches for measuring impact of publications. The most widely used metrics for measuring impact are based on citations. However, counting citations not taking into account the publication content and the qualitative nature of the citation. In addition, there is a delay between the publication and the measurable impact in citations. We in particular encourage papers addressing new ways of evaluating publications’ impact beyond standard citation measures.
  • New methods for measuring performance of researchers. Methods for assessing impact of a publication can be often extended to methods that can assess the impact of individual researchers. However, there are also other criteria for measuring impact in addition to publications, such as the development and publication of research data, economical and market impact that should also be taken into account. We welcome papers addressing these aspects.
  • Evaluating impact of research groups. The same as for impact of individuals holds for research communities.
  • Methods for identifying research trends and cross-fertilization between research disciplines. Identifying research trends should allow discovering newly emerging disciplines or it should help to explain why certain fields are attracting the attention of a wider research community. Such monitoring is important for research funders and governments in order to be able to quickly respond to new developments. We invite papers discussing new methods for identifying trends and cross-fertilization between research disciplines using methods ranging from social network analysis and text- and datamining to innovative visualization approaches.
  • Application and case studies of mining from scientific databases and publications. New methods and models developed for mining from scientific publications can be applied in many different scenarios, such as improving access to scientific publications, providing exploratory search in digital collections, identifying experts. We encourage papers describing innovative approaches that use scientific publications and data to solve real-world problems.
  • Improving the infrastructure of repositories to support the development and integration of new impact and performance metrics. New ways of improving the repository infrastructure can include, for example, tracking accesses and downloads, researcher profiling and the interlinking of repository data with external services.. These can be in turn used for developing new impact metrics. We welcome papers addressing these issues.


This year we would like to invite the workshop participants to make use of the CORE publications dataset containing large volume of research publications from a wide variety of research areas. The dataset contains not only full-texts, but also an enriched version of publications’ metadata. The aim is to provide a framework for developing and testing methods and tools addressing the workshop topics. The use of this dataset is not mandatory, however it is encouraged. The dataset is now available through CORE portal here


The workshop on Mining Scientific Publications aims to bring together researchers, digital library developers and practitioners from government and industry to address the current challenges in the domain of mining scientific publications.


The The 1st International Workshop on Mining Scientific Publications was previously held in conjunction with JCDL 2012. The 2nd run of this workshop was held in conjunction with JCDL 2013. Both runs of the workshop have been extremely successful in terms of attracting submissions and participants from leading institutions in the area, such as British Library, Elsevier Labs, National Library of Medicine, Library of Congress, University of Pennsylvania (CiteSeerX) or Mendeley. The submissions from both of these workshops have been published as a special issue in D-Lib.


We invite submissions related to the workshop’s topics. Long papers should not exceed 8 pages and short papers should not exceed 4 pages of the ACM style. Furthermore, we welcome demo presentations of systems or methods. A demonstration submission should consist of a maximum two page description of the system, method or tool to be demonstrated.

Papers should be submitted using the easychair system provided here.

Successful submissions will be published in the D-Lib Magazine.

The 1st international workshop on mining scientific publications proceedings are available here.

The 2nd international workshop on mining scientific publications proceedings are available here.


The workshop will include keynote presentations from Dr. C. Lee Giles and Prof. Birger Larsen

Dr. C. Lee Giles is the David Reese Professor of Information Sciences and Technology at the Pennsylvania State University with appointments in the departments of Computer Science and Engineering, and Supply Chain and Information Systems. His research interests are in intelligent cyberinfrastructure and big data, web tools, specialty search engines, information retrieval, digital libraries, web services, knowledge and information extraction, data mining, entity disambiguation, and social networks. He has published nearly 400 papers in these areas with over 24,000 citations and an h-index of 73 according to Google Scholar. He was a cocreator of the popular search engine CiteSeer (now CiteSeerX) and related scholarly search engines. He is a fellow of the ACM, IEEE, and INNS.
Information Extraction and Data Mining for Scholarly Big Data

Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss the size of scholarly big data and present challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems based on the SeerSuite software. Using information extraction and data mining, we illustrate applications in such diverse areas as computer science, chemistry, archaeology, acknowledgements, citation recommendation, collaboration recommendation, and others.
Madian Khabsa, C. Lee Giles, "The Number of Scholarly Documents on the Public Web," PLoSONE, 2014.
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Douglas Jordan, C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." To appear in: Twenty sixth Annual Conference on Innovative Applications of Artificial Intelligence, 2014.

Birger Larsen is Professor in Information Analysis and Information Retrieval at the Department of Communication at Aalborg University Copenhagen. His main research interests include Information Retrieval (IR), structured documents in IR, XML IR and user interaction, domain specific search, understanding user intents and exploiting context in IR, as well as Informetrics/Bibliometrics, citation analysis and quantitative research evaluation.

Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval

Anchor text, the text clicked on or immediately surrounding an outgoing hyperlink, has been used successfully to represent the document linked to in web search (e.g. Brin & Page, 1998). Ritchie and colleagues have shown the potential for exploiting anchor text in scientific and scholarly retrieval, but on fairly small and not publicly available datasets in the field of Computational Linguistics (Ritchie, Teufel and Robertson, 2009). As citation behaviour differs considerably between fields (Moed, 2005) we revisit the question of the usefulness of anchor text in a different field: physics. Using the iSearch test collection (Lykke et al., 2010) and full text versions of its documents now freely available from the electronic preprint archive, we investigate 1) if we can identify citations and anchor text reliably using iSearch and the latex arXiv source documents, 2) how to integrate this into ranking in a language modelling framework and 3) the optimal anchor text window size for this task. The presentation also presents iSearch and discusses the possibilities for building even larger scientific test collections for IR.


July 13, 2014 - Submission deadline

July 20, 2014 - Submission deadline (extended)

August 11, 2014 - Notification of acceptance

August 25, 2014 - Camera-ready

September 12, 2014 - Workshop


Introduction and keynote 1

Chair: Petr Knoth




Keynote talk

Information Extraction and Data Mining for Scholarly Big Data

Dr. C. Lee Giles

Paper session 1

Chair: Drahomira Herrmannova


Long paper

A Comparison of two Unsupervised Table Recognition Methods from Digital Scientific Articles

Stefan Klampfl, Kris Jack and Roman Kern

Nominated for best paper award


Short paper

A Keyquery-Based Classification System for CORE

Michael Völske, Tim Gollub, Matthias Hagen and Benno Stein

Nominated for best paper award


Short paper

Discovering and visualizing interdisciplinary content classes in scientific publications

Theodoros Giannakopoulos, Ioannis Foufoulas, Eleftherios Stamatogiannakis, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis



Paper session 2

Chair: Nuno Freire


Long paper

Efficient blocking method for a large scale citation matching

Mateusz Fedoryszak and Łukasz Bolikowski


Long paper

Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers

Giovanni Yoko Kristianto, Goran Topic and Akiko Aizawa


Short paper

Towards a Marketplace for the Scientific Community: Accessing Knowledge from the Computer Science Domain

Mark Kröll, Stefan Klampfl and Roman Kern


Short paper

Experiments on Rating Conferences with CORE and DBLP

Irvan Jahja, Suhendry Effendy and Roland Yap


Short paper

A new semantic similarity based measure for assessing research contribution

Petr Knoth and Drahomira Herrmannova

Nominated for best paper award



Elsevier's Text and Data Mining Policy

Gemma Hersh



Keynote 2

Chair: Zdenek Zdrahal


Keynote talk

Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval

Birger Larsen

Demo session

Chair: Loukas Anastasiou


Demo paper

AMI-diagram: Mining Facts from Images

Peter Murray-Rust, Richard Smith-Unna and Ross Mounce


Demo paper

Annota: Towards Enriching Scientific Publications with Semantics and User Annotations

Michal Holub, Róbert Móro, Jakub Ševcech, Martin Lipták and Maria Bielikova


Demo paper

The ContentMine scraping stack: literature-scale content mining with community maintained collections of declarative scrapers

Richard Smith-Unna and Peter Murray-Rust



Paper session 3

Chair: Zdenek Zdrahal


Long paper

GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Dominika Tkaczyk, Pawel Szostek and Lukasz Bolikowski

Nominated for best paper award


Long paper

The Architecture and Datasets of Docear’s Research Paper Recommender System

Joeran Beel, Stefan Langer, Bela Gipp, and Andreas Nürnberger


Long paper

Social, Political and Legal Aspects of Text and Data Mining

Michelle Brook, Peter Murray-Rust and Charles Oppenheim


Chair: Kris Jack



Following the end of the workshop there is going to be a social dinner ot 5:30pm open to all attendees in the Wilmington pub

Wilmington pub is located at 69 Rosebery Avenue, Clerkenwell, EC1R 4RL, very close to the venue of the workshop: only 7 minutes walking distance, see below


Petr Knoth, Knowledge Media institute, The Open University, UK

Zdenek Zdrahal, Knowledge Media institute, The Open University, UK

Stelio Piperidis, Institute for Language and Speech processing (META-SHARE), Athena Research Center, Greece

Nuno Freire, The European Library/Europeana, The Netherlands

Kris Jack, Mendeley Ltd., United Kingdom

Drahomira Herrmannova, Knowledge Media institute, The Open University, UK

Lucas Anastasiou, Knowledge Media institute, The Open University, UK


Bruno Martins, Technical University of Lisbon(IST), Portugal

Eloy Rodrigues, University of Minho, Portugal

Francesco Osborne, Knowledge Media institute, The Open University, UK

Iryna Gurevych, Darmstadt University of Technology, Germany

Martin Klein, Los Alamos National Laboratory, USA

Natalia Manola, University of Athens, Greece

Paolo Manghi, ISTI-CNR (DRIVER, OpenAIRE), Italy

Pável Calado, Technical University of Lisbon(IST), Portugal

Robert M. Patton, Oak Ridge National Laboratory, USA

Robert Sanderson, Digital Library Systems and Services, Stanford, USA

Roman Kern, Know Center Graz, Austria

Tanja Urbancic, Jožef Stefan Institute, Slovenia

Wojtek Sylwestrzak, University of Warsaw, Poland

Ziqi Zhang, Department of Computer Science, University of Sheffield, UK


City University London

College and Social Sciences Buildings

St. John Street

London EC1V 0HB

Please mind that this a different venue than the main conference.

©3rd International Workshop on Mining Scientific Publications. Design based on CSS Templates For Free. Design based on Free Website Templates.