Meta-learning is a technique often applied to the problem of algorithm selection . Examples include PDFX, ParsCit, GROBID, CERMINE, Icecite and Team-Beam. These systems automatically extract machine-readable information, such as metadata, bibliography, logical structure, or fulltext, from unstructured documents. Some reference parsers are parts of larger systems for information extraction from scientific papers. To the best of our knowledge, all open-source reference parsers are based on a single technique, none of them uses any ensemble, hybrid or meta-learning techniques. Typically the most effective approach for reference parsing is supervised machine learning, such as Conditional Random Fields (ParsCit, GROBID, CERMINE, Anystyle-Parser, Reference Tagger and Science Parse ), or Recurrent Neural Networks combined with Conditional Random Fields (Neural ParsCit ). Reference parsers often use regular expressions, hand-crafted rules, and template matching (Biblio, Citation, Citation-Parser, PDFSSA4MET, and BibPro ). This paper is an extended version of a poster published at the 12 th ACM Conference on Recommender Systems 2018 (RecSys). The novel aspects of ParsRec are: 1) considering reference parsing as a recommendation problem, 2) using a meta learning-based hybrid approach for reference parsing. ParsRec uses supervised machine learning to recommend the best parser(s) for the input reference string. ParsRec is built upon ten open-source parsers mentioned before.
ParsRec takes as input a reference string, identifies the potentially best reference parser(s), applies the chosen parser(s), and outputs the metadata fields. In this paper we propose ParsRec, a novel meta-learning recommender system for bibliographic reference parsers.
a software developer or a researcher) needs the item (reference parser) that satisfies the user‘s needs best (high quality of metadata fields extracted from reference strings). This can be seen as a typical recommendation problem: a user (e.g. Consequently, we hypothesize that if we were able to accurately choose the best parser for a given scenario, the overall quality of the results should increase. Instead, different parsers might give the best results for different metadata types and different reference strings. These results suggest that there is no single best parser.
Science Parse, ranked 4th overall, is best in extracting the year. For example, ParsCit is ranked 3rd in the overall ranking but is best for extracting author names. Our results also showed that different tools have different strengths and weaknesses. The overall parsing results varied greatly, with F1 ranging from 0.27 for Citation-Parser to 0.89 for GROBID. Recently we compared the performance of ten open source parsers : Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse. There exist many ready-to-use open-source reference parsers. a machine-readable representation of the reference string from Fig. The marked metadata fields are of types: author name (2 fields), title, journal, volume, issue, year, pages.įig. An example bibliographic reference string that could be the input of reference parsing. Citation matching is required for assessing the impact of researchers, journals and research institutions, and for calculating document similarity, in the context of academic search engines and recommender systems. “2018” or “AICS”).īibliographic reference parsing is useful for identifying cited documents, also known as citation matching.
A parsed reference is a collection of metadata fields, each of which is composed of a metadata type (e.g. The output is a machine-readable representation of the input string, typically called a parsed reference ( Fig. In reference parsing, the input is a single reference string, formatted in a specific bibliography style ( Fig. The bibliographic BibTeX data is as = ,īibliographic reference parsing is a well-known task in scientific information extraction and document engineering. It is an extended version of our recently presented poster “ ParsRec: Meta-Learning Recommendations for Bibliographic Reference Parsing” at the ACM RecSys conference. Our manuscript “ParsRec: A Novel Meta-Learning Approach to Recommending Bibliographic Reference Parsers” was accepted for publication at the 26th Irish Conference on Artificial Intelligence and Cognitive Science (AICS).