Organised by:

open university


Digital libraries that store scientific publications are becoming increasingly central to the research process. They are not only used for traditional tasks, such as finding and storing research outputs, but also as a source for discovering new research trends or evaluating research excellence. With the current growth of scientific publications deposited in digital libraries, it is no longer sufficient to provide only access to content. To aid research, it is especially important to leverage the potential of text and data mining technologies to improve  the process of how research is being done.

This workshop aims to bring together people from different backgrounds who:

  • are interested in analysing and mining databases of scientific publications
  • develop systems that enable such analysis and mining of scientific databases (especially those who run databases of publications)
  • who develop novel technologies that improve the way research is being done.

  • 2. TOPICS

    The topics of the workshop will be organised around the following themes:

    1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications and surrounding issues, such as interoperability and data sharing.
    2. Semantic enrichment of scientific publications by means of text and data mining, crowdsourcing or other methods.
    3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

    Topics of interest relevant to theme 1 include but are not limited to:

    • Infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs for accessing scientific publications and/or research data.
    • Interoperability issues in research TDM workflows
    • around integration of cutting-edge tools in production systems

    Topics of interest relevant to theme 2 include, but are not limited to:

    • Information extraction and text-mining applied to scholarly data
    • Automatic categorization and clustering of scholarly data
    • Approaches to information retrieval of academic publications
    • Academic recommender systems
    • Models for semantically representing and annotating publications (ontologies, interoperability issues, etc.)
    • Literature-based discovery
    • (Reproducible) text and data mining workflows for scientific publications
    • Scholarly knowledge graphs

    Topics of interest relevant to theme 3 include, but are not limited to:

    • Measuring impact of publications (bibliometrics, webometrics, altmetrics, semantometrics)
    • Higher-level impact metrics to assess performance of researchers, departments, universities, etc.
    • Analysing research collaboration networks
    • Methods for identifying research trends and cross-fertilization between research disciplines.
    • Application and case studies of mining from scientific databases and publications.


    We would like to invite the workshop participants to makes use of the CORE publications dataset containing large volume of research publications from a wide variety of research areas. The dataset contains not only full-texts, but also an enriched version of publications' metadata. This dataset provides a framework for developing and testing methods and tools addressing the workshop topics. The use of this dataset is not mandatory, however it is encouraged. The dataset is available through the CORE portal: here


    The workshop on Mining Scientific Publications aims to bring together researchers, digital library developers and practitioners from government and industry to address the current challenges in the domain of mining scientific publications.


    The 1st International Workshop on Mining Scientific Publications was held in conjunction with JCDL 2012. The 2nd run of this workshop was held in conjunction with JCDL 2013. The 3rd run was associated with DL 2014 in London. The 4th run took place together with JCDL 2015. Finally, the 5th run of this workshop was associated JCDL 2016. All runs of the workshop have been extremely successful in terms of attracting submissions and participants from leading institutions in the area including Cambridge University, Microsoft, British Library, Elsevier, National Library of Medicine, Library of Congress, University of Pennsylvania (CiteSeerX), Know-Center Graz, University of Athens (OpenAIRE project) and Mendeley.

    6. FORMAT

    We plan this workshop as a one whole-day event. The workshop is organized this year for the fifth time (the four previous workshops were also in association with JCDL) and is planned to take place yearly. The workshop will consist of two invited talks, a series of presentations followed by a short discussion, a short work in groups session dedicated to addressing specific issues in the field and a final round table discussion at the end of the day. The workshop participants will be also encouraged to visit and experience demonstrations that will be presented during coffee breaks. In the evening, the workshop participants will have the possibility to attend an informal dinner.


    We invite submissions related to the workshop's topics. Long papers should not exceed 8 pages and short papers should not exceed 4 pages of the ACM style. Furthermore, we welcome demo presentations of systems or methods. A demonstration submission should consist of a maximum two-page description of the system, method or tool to be demonstrated. All submissions will be uploaded to EasyChair for a peer-review.

    Papers should be submitted using the EasyChair system provided here:

    Successful submissions will be published as a special issue in the D-Lib journal . See previous proceedings at here


    All submissions will be peer-reviewed and meta-reviewed by members of the Programme Committee. Each publication will be assigned a score and the best publications will be selected. In this sense, the process will be the same as in the last years.


    This year, we have applied for publishing accepted short and full papers in the ACM International Conference Proceedings Series (ICPS). We are currently awaiting ACM's decision on the matter.

    The proceedings of the special issues from the last years are available at:

    D-Lib July/August 2012 contents

    D-Lib September/October 2013 contents

    D-Lib November/December 2014 contents

    D-Lib November/December 2015 contents

    D-Lib September/October 2016 contents


    Waleed Ammar, Allen Institute for Artificial Intelligence
    Waleed Ammar is the research team lead for He develops models for converting natural language text into structured representations, with a special focus on scientific publications. Before doing his Ph.D. at Carnegie Mellon University, Waleed was an SDE2 at Microsoft Research, web developer at eSpace Technologies, and teaching assistant at Alexandria University. He was awarded the Google PhD fellowship award and two Microsoft Research Tech Transfer awards.
    Towards a more efficient, less painful discovery of scientific research findings
    How do we help scientists find their needle in a haystack of scientific publications? In this talk, I will first give an overview on several projects we're working on at the Allen Institute for Artificial Intelligence to address this question, including advances in ranking, figure extraction, metadata extraction, document similarity and question answering. Then, I will describe the literature graph, our approach to capture semantics via a symbolic representation of the scientific literature, and discuss preliminary results and future work.

    Jevin D. West, University of Washington
    Jevin West is an Assistant Professor at the Information School at the University of Washington and co-director of the DataLab. He develops tools and methods for reading the literature at the scale of millions of publications. These tools include auto-categorization approaches, network visualization designs, and recommender systems. Using these tools, he investigates biases in science, the origin of ideas and disciplines, and reward structures in science. He co-founded several research projects around these ideas including and
    Viziometrics: building a figure-centric search engine for the scholarly literature
    Figures are a primary mode for communicating scientific results, yet little has been done to extract and analyze this information at scale. Most of the work in mining the literature has been on full text, citations, or metadata associate with an article. These visual objects are information dense and complex, but as the saying goes, worth a thousand words. In this talk, I will present some methods for extracting this information and provide some ways that this information can be used for better searching the scholarly literature and for asking basic questions around visual communication and impact.


    Sunday, 23rd April 2017 11:59 (Hawaii time) - Submission deadline

    Friday, 5th May 2017 11:59 (Hawaii time) - Extended Submission deadline

    Thursday, 18th May 2017 - Notification of acceptance

    Monday, 12th June 2017 - Camera-ready

    Monday, 19th June 2017 - Workshop

    12. PROGRAM

    9:00-9:10 Introduction
    9:10-9:45 Keynote talk
    Towards a more efficient, less painful discovery of scientific research findings
    Waleed Ammar
    9:45-10:05 Long paper
    Analyzing Semantic Concept Patterns to Detect Academic Plagiarism
    Norman Meuschke, Nicolas Siebeck, Moritz Schubotz and Bela Gipp
    10:05-10:20 Short paper
    Investigating Convolutional Networks and Domain-Specific Embeddings for Semantic Classification of Citations
    Anne Lauscher, Goran Glavas, Simone Paolo Ponzetto and Kai Eckert
    10:20-10:40 Long paper
    AppTechMiner: Mining Applications and Techniques from Scientific Articles
    Mayank Singh, Soham Dan, Sanyam Agarwal, Pawan Goyal and Animesh Mukherjee
    10:40-11:10 Break
    11:10-11:30 Long paper
    Word importance-based similarity of documents metric (WISDM)
    Viktor Botev, Kaloyan Marinov and Florian Schäfer
    11:30-11:45 Short paper
    Audience Based View of Publication Impact
    Robert Patton, Drahomira Herrmannova, Christopher Stahl, Jack Wells and Thomas Potok
    11:45-12:05 Long paper
    Multi-level mining and visualization of scientific text collections. Exploring a bilingual scientific repository
    Pablo Accuosto, Francesco Ronzano, Daniel Ferrés and Horacio Saggion
    12:05-12:20 Demo paper
    Content Analytics Toolbench (CAT): a flexible single point of access for content enhancement and data analytics across massive corpora
    Ron Daniel and Michael Lauruhn
    12:20-12:40 Long paper
    Rapid Tagging and Reporting for Functional Language Extraction in Scientific Articles
    Mahmood Ramezani, Vijay Kalivarapu, Stephen Gilbert, Sarah Huffman, Elena Cotos and Annette O'Connor
    12:40-13:00 Invited talk
    Towards effective research recommender systems
    Petr Knoth
    13:00-14:00 Lunch
    14:00-14:35 Keynote talk
    Viziometrics: building a figure-centric search engine for the scholarly literature
    Jevin West
    14:35-14:55 Long paper
    HyPRec: a Weighted Hybrid Approach for Scientific Paper Recommendation
    Anas Alzoghbi, Mostafa M. Mohamed, Omar Nada, Ibrahim Alshibani, Victor Anthony Arrascue Ayala and Georg Lausen
    14:55-15:10 Short paper
    Comparing citation numbers between articles at two stages of a Model Organism Database curation workflow
    Michael Lauruhn and Gillian Millburn
    15:10-15:30 Long paper
    Methods for Synthesis of Funding Agency & Publisher Data
    Monica Ihli
    15:30-16:00 Break
    16:00-16:20 Long paper
    Geographical Distribution of Biomedical Research in the USA
    Yingjun Guan, Jing Du and Vetle Torvik
    16:20-16:35 Demo paper
    Iris.AI - Science Assistant
    Viktor Botev
    16:35-16:50 Short paper
    A Discipline-Enriched Dataset for Tracking the Computational Turn of European Universities
    Federico Nanni and Giulia Paci
    16:50-17:00 Closing


    Petr Knoth, Knowledge Media institute, The Open University, UK

    Robert Patton, Oak Ridge National Laboratory, USA

    Drahomira Herrmannova, Oak Ridge National Laboratory, USA

    David Pride, Knowledge Media institute, The Open University, UK

    Anita Khadka, Knowledge Media institute, The Open University, UK


    Iana Atanassova, CRIT, Université de Bourgogne Franche-Comté, France

    Joeran Beel, Trinity College, University of Dublin, Ireland

    Marc Bertin, Paris-Sorbonne University, France

    Pável Calado, Instituto Superior Técnico, Universidade de Lisboa, Portugal

    Tanmoy Chakraborty, University of Maryland, USA

    Aristotelis Charalampous, KMi,The Open University, UK

    Daniel Duma, University of Edinburgh, UK

    Shang Gao, Oak Ridge National Laboratory, USA

    Christopher G. Harris, SUNY Oswego, USA

    Saeed Ul Hassan, Information Technology University, Pakistan

    Antoine Isaac, Europeana & VU University Amsterdam, The Netherlands

    Roman Kern, Graz University of Technology,Austria

    Martin Klein, Los Alamos National Laboratory, USA

    Birger Larsen, Aalborg University Copenhagen, Denmark

    Paolo Manghi, ISTI-CNR, Italy

    Bruno Martins, Instituto Superior Técnico, Universidade de Lisboa, Portugal

    Philipp Mayr, GESIS - Leibniz Institute for the Social Sciences, Germany

    Peter Mutschke, GESIS - Leibniz Institute for the Social Sciences, Germany

    Franco Maria Nardini, ISTI-CNR, Italy

    Francesco Osborne, KMi, The Open University, UK

    John X. Qiu, Oak Ridge National Laboratory/University of Tennessee, USA

    Eloy Rodrigues, Universidade do Minho, Portugal

    Angelo Antonio Salatino, KMi, The Open University, UK

    Pavel Smrz, Brno University of Technology, Czech Republic

    Mike Thelwall, University of Wolverhampton, UK

    Vetle Torvik, University of Illinois, USA

    Michael T. Young, Oak Ridge National Laboratory, USA

    12. LOCATION

    University of Toronto

    College View Ave, Toronto


    ©6th International Workshop on Mining Scientific Publications. Design based on CSS Templates For Free. Design based on Free Website Templates.