The 4th International Workshop on Mining Scientific Publications website is now available here.
1. INTRODUCTION
Digital libraries that store scientific publications are becoming
increasingly central to the research process. They are not only
used for traditional tasks, such as finding and storing research
outputs, but also as a source for discovering new research trends
or evaluating research excellence. With the current growth of
scientific publications deposited in digital libraries, it is no longer
sufficient to provide only access to content. To aid research it is
especially important to improve the process of how research is
being done.
The recent development in natural language processing,
information retrieval and the semantic web make it possible to
transform the way we work with scientific publications. However,
in order to be able to improve these technologies and carry out
experiments, researchers need to be able to easily access and use
large databases of scientific publications.
This workshop aims to bring together people from different
backgrounds who: (a) are interested in analysing and mining
databases of scientific publications, (b) develop systems that
enable such analysis and mining of scientific databases or (c) who
develop novel technologies that improve the way research is being
done.
2. TOPICS
The topics of the workshop will be organised around the
following themes:
- The whole ecosystem of infrastructures including
repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, s ervices and APIs that enable analysis of large volumes of scientific
publications.
- Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
-
Analysis of large databases of scientific publications to
identify research trends, high impact, cross-fertilisation
between disciplines, research excellence etc.
Topics of interest relevant to theme 1 include, but are not limited to:
-
Infrastructures including repositories, aggregators, text-and
data-mining facilities, impact monitoring tools, datasets,,
services and APIs for accessing scientific publications
and/or research data.
The existence of datasets, services,
systems and APIs (in particular those that are open)
providing access to large volumes of scientific publications
and research data, is an essential prerequisite for being able
to research and develop new technologies that can transform
the way people do research. We invite papers presenting
innovative approaches to the development of these systems
that enable people to access databases and carry out their
analysis. Papers addressing Open Access are of special
interest. We also welcome submissions discussing the
technical aspects of supporting Open Science, in particular
reproducibility of research, sharing of scientific workflows
and linking research data with publications. Finally, we also
invite papers discussing issues and current challenges in the
design of these systems.
Topics of interest relevant to theme 2 include, but are not limited to:
-
Novel information extraction and text-mining approaches to
semantic enrichment of publications.
This might range from
mining publication structure, such as title, abstract, authors,
citation information etc. to more challenging tasks, such as
extracting names of applied methods, research questions (or
scientific gaps), identifying parts of the scholarly discourse
structure etc.
-
Automatic categorization and clustering of scientific
publications.
Methods that can automatically categorize
publications according to an established subject-based
classification/taxonomy (such as Library of Congress
classification, UNESCO thesaurus, DOAJ subject
classification, Library of Congress Subject Headings) are of
particular interest. Other approaches might involve
automatic clustering or classification of research
publications according to various criteria.
-
New methods and models for connecting and interlinking
scientific publications.
Scientific publications in digital
libraries are not isolated islands. Connecting publications
using explicitly defined citations is very restrictive and has
many disadvantages. We are interested in innovative
technologies that can automatically connect and interlink
publications or parts of publications according to various
criteria, such as semantic similarity, contradiction, argument
support or other relationship types.
-
Models for semantically representing and annotating publications.
This topic is related to the aspect of
semantically modeling publications and scholarly discourse.
Models that are practical with respect to the state-of-the-art
in Natural Language Processing (NLP) technologies are of a
special interest.
-
Semantically enriching/annotating publications by
crowdsourcing.
Crowdsourcing can be used in innovative
ways to annotate publications with richer metadata or to
approve/disapprove annotations created using text-mining or
other approaches. We welcome papers that address the
following questions: (a) what incentives should be provided
to motivate users in contributing, (b) how to apply
crowdsourcing in the specialized domains of scientific
publications, (c) what tasks in the domain of organising
scientific publications is crowdsourcing suitable for and
where it might fail, other relevant crowdsourcing topics
relevant to the domain of scientific publications.
Topics of interest relevant to theme 3 include, but are not limited to:
-
New methods, models and innovative approaches for
measuring impact of publications.
The most widely used
metrics for measuring impact are based on citations. However,
counting citations not taking into account the publication
content and the qualitative nature of the citation. In addition,
there is a delay between the publication and the measurable
impact in citations. We in particular encourage papers
addressing new ways of evaluating publications’ impact
beyond standard citation measures.
-
New methods for measuring performance of researchers.
Methods for assessing impact of a publication can be often
extended to methods that can assess the impact of individual
researchers. However, there are also other criteria for
measuring impact in addition to publications, such as the
development and publication of research data, economical and
market impact that should also be taken into account. We
welcome papers addressing these aspects.
-
Evaluating impact of research groups.
The same as for impact
of individuals holds for research communities.
-
Methods for identifying research trends and cross-fertilization
between research disciplines.
Identifying research trends
should allow discovering newly emerging disciplines or it
should help to explain why certain fields are attracting the
attention of a wider research community. Such monitoring is
important for research funders and governments in order to be
able to quickly respond to new developments. We invite
papers discussing new methods for identifying trends and
cross-fertilization between research disciplines using methods
ranging from social network analysis and text- and datamining
to innovative visualization approaches.
-
Application and case studies of mining from scientific
databases and publications.
New methods and models
developed for mining from scientific publications can be
applied in many different scenarios, such as improving access
to scientific publications, providing exploratory search in
digital collections, identifying experts. We encourage papers
describing innovative approaches that use scientific
publications and data to solve real-world problems.
-
Improving the infrastructure of repositories to support the
development and integration of new impact and performance
metrics.
New ways of improving the repository infrastructure
can include, for example, tracking accesses and downloads,
researcher profiling and the interlinking of repository data
with external services.. These can be in turn used for
developing new impact metrics. We welcome papers
addressing these issues.
3. SPECIAL OPEN PUBLICATIONS DATASET TRACK
This year we would like to invite the workshop participants to
make use of the CORE publications dataset containing large
volume of research publications from a wide variety of research
areas. The dataset contains not only full-texts, but also an enriched
version of publications’ metadata. The aim is to provide a
framework for developing and testing methods and tools
addressing the workshop topics. The use of this dataset is not
mandatory, however it is encouraged.
The dataset is now available through CORE portal here
4. EXPECTED AUDIENCE
The workshop on Mining Scientific Publications aims to bring
together researchers, digital library developers and practitioners
from government and industry to address the current challenges in
the domain of mining scientific publications.
5. PREVIOUS ORGANISATION
The The 1st International Workshop on Mining Scientific Publications was previously held in conjunction with JCDL 2012. The 2nd run of this workshop was held in conjunction with JCDL 2013. Both runs of the workshop
have been extremely successful in terms of attracting submissions
and participants from leading institutions in the area, such as
British Library, Elsevier Labs, National Library of Medicine,
Library of Congress, University of Pennsylvania (CiteSeerX) or
Mendeley. The submissions from both of these workshops have
been published as a special issue in D-Lib.
6. SUBMISSION FORMAT
We invite submissions related to the workshop’s topics. Long
papers should not exceed 8 pages and short papers should not
exceed 4 pages of the ACM style. Furthermore, we welcome
demo presentations of systems or methods. A demonstration
submission should consist of a maximum two page description of
the system, method or tool to be demonstrated.
Papers should be submitted using the easychair system provided here.
Successful submissions will be published in the D-Lib Magazine.
The 1st international workshop on mining scientific publications proceedings are available here.
The 2nd international workshop on mining scientific publications proceedings are available here.
6. KEYNOTE SPEAKERS
The workshop will include keynote presentations from Dr. C. Lee Giles and Prof. Birger Larsen
Dr. C. Lee Giles is the David Reese Professor of Information Sciences and Technology at the Pennsylvania State University with appointments in the departments of Computer Science and Engineering, and Supply Chain and Information Systems. His research interests are in intelligent cyberinfrastructure and big data, web tools, specialty search engines, information retrieval, digital libraries, web services, knowledge and information extraction, data mining, entity disambiguation, and social networks. He has published nearly 400 papers in these areas with over 24,000 citations and an h-index of 73 according to Google Scholar. He was a cocreator of the popular search engine CiteSeer (now CiteSeerX) and related scholarly search engines. He is a fellow of the ACM, IEEE, and INNS.
Information Extraction and Data Mining for Scholarly Big Data
Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss the size of scholarly big data and present challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems based on the SeerSuite software. Using information extraction and data mining, we illustrate applications in such diverse areas as computer science, chemistry, archaeology, acknowledgements, citation recommendation, collaboration recommendation, and others.
References:
Madian Khabsa, C. Lee Giles, "The Number of Scholarly Documents on the Public Web," PLoSONE, 2014.
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Douglas Jordan, C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." To appear in: Twenty sixth Annual Conference on Innovative Applications of Artificial Intelligence, 2014.
Birger Larsen is Professor in Information Analysis and Information Retrieval at the Department of Communication at Aalborg University Copenhagen. His main research interests include Information Retrieval (IR), structured documents in IR, XML IR and user interaction, domain specific search, understanding user intents and exploiting context in IR, as well as Informetrics/Bibliometrics, citation analysis and quantitative research evaluation.
Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval
Anchor text, the text clicked on or immediately surrounding an outgoing hyperlink, has been used successfully to represent the document linked to in web search (e.g. Brin & Page, 1998). Ritchie and colleagues have shown the potential for exploiting anchor text in scientific and scholarly retrieval, but on fairly small and not publicly available datasets in the field of Computational Linguistics (Ritchie, Teufel and Robertson, 2009). As citation behaviour differs considerably between fields (Moed, 2005) we revisit the question of the usefulness of anchor text in a different field: physics. Using the iSearch test collection (Lykke et al., 2010) and full text versions of its documents now freely available from the electronic preprint archive arXiv.org, we investigate 1) if we can identify citations and anchor text reliably using iSearch and the latex arXiv source documents, 2) how to integrate this into ranking in a language modelling framework and 3) the optimal anchor text window size for this task. The presentation also presents iSearch and discusses the possibilities for building even larger scientific test collections for IR.
7. IMPORTANT DATES
July 13, 2014 - Submission deadline
July 20, 2014 - Submission deadline (extended)
August 11, 2014 - Notification of acceptance
August 25, 2014 - Camera-ready
September 12, 2014 - Workshop
8. PROGRAM
Introduction and keynote 1
Chair: Petr Knoth
|
09:00-09:10 |
Introduction
|
09:10-09:45 |
Keynote talk
Information Extraction and Data Mining for Scholarly Big Data
Dr. C. Lee Giles
|
Paper session 1
Chair: Drahomira Herrmannova
|
09:45-10:10 |
Long paper
A Comparison of two Unsupervised Table Recognition Methods from Digital Scientific Articles
Stefan Klampfl, Kris Jack and Roman Kern
Nominated for best paper award
|
10:10-10:30 |
Short paper
A Keyquery-Based Classification System for CORE
Michael Völske, Tim Gollub, Matthias Hagen and Benno Stein
Nominated for best paper award
|
10:30-10:50 |
Short paper
Discovering and visualizing interdisciplinary content classes in scientific publications
Theodoros Giannakopoulos, Ioannis Foufoulas, Eleftherios Stamatogiannakis, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis
|
10:50-11:10 |
Break
|
Paper session 2
Chair: Nuno Freire
|
11:10-11:35 |
Long paper
Efficient blocking method for a large scale citation matching
Mateusz Fedoryszak and Łukasz Bolikowski
|
11:35-12:00 |
Long paper
Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers
Giovanni Yoko Kristianto, Goran Topic and Akiko Aizawa
|
12:00-12:20 |
Short paper
Towards a Marketplace for the Scientific Community: Accessing Knowledge from the Computer Science Domain
Mark Kröll, Stefan Klampfl and Roman Kern
|
12:20-12:40 |
Short paper
Experiments on Rating Conferences with CORE and DBLP
Irvan Jahja, Suhendry Effendy and Roland Yap
|
12:40-13:00 |
Short paper
A new semantic similarity based measure for assessing research contribution
Petr Knoth and Drahomira Herrmannova
Nominated for best paper award
|
13:00-13:10 |
Presentation
Elsevier's Text and Data Mining Policy
Gemma Hersh
|
13:10-14:00 |
Lunch
|
Keynote 2
Chair: Zdenek Zdrahal
|
14:00-14:35 |
Keynote talk
Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval
Birger Larsen
|
Demo session
Chair: Loukas Anastasiou
|
14:35-14:50 |
Demo paper
AMI-diagram: Mining Facts from Images
Peter Murray-Rust, Richard Smith-Unna and Ross Mounce
|
14:50-15:05 |
Demo paper
Annota: Towards Enriching Scientific Publications with Semantics and User Annotations
Michal Holub, Róbert Móro, Jakub Ševcech, Martin Lipták and Maria Bielikova
|
15:05-15:20 |
Demo paper
The ContentMine scraping stack: literature-scale content mining with community maintained collections of declarative scrapers
Richard Smith-Unna and Peter Murray-Rust
|
15:20-15:35 |
Break
|
Paper session 3
Chair: Zdenek Zdrahal
|
15:35-16:00 |
Long paper
GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles
Dominika Tkaczyk, Pawel Szostek and Lukasz Bolikowski
Nominated for best paper award
|
16:00-16:25 |
Long paper
The Architecture and Datasets of Docear’s Research Paper Recommender System
Joeran Beel, Stefan Langer, Bela Gipp, and Andreas Nürnberger
|
16:25-16:50 |
Long paper
Social, Political and Legal Aspects of Text and Data Mining
Michelle Brook, Peter Murray-Rust and Charles Oppenheim
|
Closing
Chair: Kris Jack
|
16:50-17:00 |
Closing
|
Following the end of the workshop there is going to be a social dinner ot 5:30pm open to all attendees in the Wilmington pub
Wilmington pub is located at 69 Rosebery Avenue, Clerkenwell, EC1R 4RL, very close to the venue of the workshop: only 7 minutes walking distance, see below
9. ORGANIZING COMMITTEE
Petr Knoth, Knowledge Media institute, The Open University, UK
Zdenek Zdrahal, Knowledge Media institute, The Open University, UK
Stelio Piperidis, Institute for Language and Speech processing (META-SHARE), Athena Research Center, Greece
Nuno Freire, The European Library/Europeana, The Netherlands
Kris Jack, Mendeley Ltd., United Kingdom
Drahomira Herrmannova, Knowledge Media institute, The Open University, UK
Lucas Anastasiou, Knowledge Media institute, The Open University, UK
10. PROGRAMME COMMITTEE
Bruno Martins, Technical University of Lisbon(IST), Portugal
Eloy Rodrigues, University of Minho, Portugal
Francesco Osborne, Knowledge Media institute, The Open University, UK
Iryna Gurevych, Darmstadt University of Technology, Germany
Martin Klein, Los Alamos National Laboratory, USA
Natalia Manola, University of Athens, Greece
Paolo Manghi, ISTI-CNR (DRIVER, OpenAIRE), Italy
Pável Calado, Technical University of Lisbon(IST), Portugal
Robert M. Patton, Oak Ridge National Laboratory, USA
Robert Sanderson, Digital Library Systems and Services, Stanford, USA
Roman Kern, Know Center Graz, Austria
Tanja Urbancic, Jožef Stefan Institute, Slovenia
Wojtek Sylwestrzak, University of Warsaw, Poland
Ziqi Zhang, Department of Computer Science, University of Sheffield, UK
11. LOCATION
City University London
College and Social Sciences Buildings
St. John Street
London EC1V 0HB
Please mind that this a different venue than the main conference.