Refine crossref call; documentation on deep learning models and conso…

…lidation Former-commit-id: a01aa93
kermitt2 · May 27, 2019 · eca76e4 · eca76e4
1 parent 38a7056
commit eca76e4
Show file tree

Hide file tree

Showing 8 changed files with 186 additions and 73 deletions.
diff --git a/doc/Consolidation.md b/doc/Consolidation.md
@@ -0,0 +1,47 @@
+# Consolidation
+
+In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services. 
+
+Consolidation has two main interests:
+
+* it improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 86.62 f-score, using biblio-glutton and GROBID version 0.5.5 for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)), 
+
+* it matches an extracted bibliographical reference with known publications, and complement the parsed bibliographical reference with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services. 
+
+GROBID supports two consolidation services:
+
+* [CrossRef REST API](https://github.com/CrossRef/rest-api-doc) (default)
+
+* [biblio-glutton](https://github.com/kermitt2/biblio-glutton)
+
+## CrossRef REST API
+
+The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef.  
+
+For using [reliably and politely the CrossRef REST API](https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service), it is highly recommended to add a contact email to the queries. This is done in GROBID by modifying the properties file under `grobid-home/config/grobid.properties`:
+
+```
+[email protected]
+
+```
+
+Without indicating this email, the service might be unreliable with numerous query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email.  
+
+## biblio-glutton
+
+This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy). 
+
+Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available. 
+
+After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.properties` file, with its host and port:
+
+```
+#-------------------- consolidation --------------------
+# Define the bibliographical data consolidation service to be used, eiter "crossref" for CrossRef REST API or "glutton" for https://github.com/kermitt2/biblio-glutton
+#grobid.consolidation.service=crossref
+grobid.consolidation.service=glutton
+org.grobid.glutton.host=localhost
+org.grobid.glutton.port=8080
+```
+
+Note that the GROBID online demo hosted [here](http://grobid.science-miner.com) uses  biblio-glutton as consolidation service. 
diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md
@@ -0,0 +1,31 @@
+# Using Deep Learning models instead of default CRF
+
+## Integration with DeLFT
+
+Since version 0.5.4, it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft).  The available neural models are currently BidLSTM-CRF with Glove embeddings, which can be used as alternative to the default Wapiti CRF.
+
+Note that this only works with Linux 64 bits for the moment (only 64-bits architectures will be supported). 
+
+To use them:
+
+- install [DeLFT](https://github.com/kermitt2/delft)
+
+- indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`)
+
+- change the engine from `wapiti` to `delft` in the `grobid-properties` file
+
+Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).
+
+There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks.
+
+Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment.
+
+See some evaluation under `grobid-trainer/docs`.
+
+Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach.
+
+## Future improvements
+
+ELMo embeddings have not been experimented for the GROBID models yet, but they could make some models better than their CRF counterpart, although probably too slow for concrete usage (it will make these models 100 times slower than the current CRF in our estimate). ELMo embeddings are already integrated in DeLFT.
+
+However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts). We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models. 
diff --git a/doc/Principle.md b/doc/Principle.md
@@ -0,0 +1,4 @@
+<h1>Principles</h1>
+
+
+
diff --git a/doc/index.md b/doc/index.md
@@ -27,6 +27,8 @@
 
 * [Coordinates of structures in the original PDF](Coordinates-in-PDF.md)
 
+* [Adding a consolidation service](Consolidation.md)
+
 * [Training and evaluating the GROBID models](Training-the-models-of-Grobid.md)
 
 * [End to end evaluation](End-to-end-evaluation.md)
@@ -46,6 +48,8 @@
 
 * [Developer guide](Developer-guide.md)
 
+* [Using Deep Learning models instead of default CRF](Deep-Learning-models.md)
+
 * [Recompiling and integrating CRF libraries into GROBID](Recompiling-and-integrating-CRF-libraries.md)
 
 

diff --git a/grobid-core/src/main/java/org/grobid/core/utilities/Consolidation.java b/grobid-core/src/main/java/org/grobid/core/utilities/Consolidation.java
@@ -180,24 +180,27 @@ public BiblioItem consolidate(BiblioItem bib, String rawCitation) throws Excepti
             // call with full raw string
             if (arguments == null)
                 arguments = new HashMap<String,String>();
-            arguments.put("query.bibliographic", rawCitation);
+            if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                     StringUtils.isBlank(doi) )
+                arguments.put("query.bibliographic", rawCitation);
             //arguments.put("query", rawCitation);
         }
         if (StringUtils.isNotBlank(aut)) {
             // call based on partial metadata
             if (arguments == null)
                 arguments = new HashMap<String,String>();
-            if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || (arguments.size() == 0) )
+            if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                 (StringUtils.isBlank(rawCitation) && StringUtils.isBlank(doi)) )
                 arguments.put("query.author", aut);
         }
         if (StringUtils.isNotBlank(title)) {
             // call based on partial metadata
             if (arguments == null)
                 arguments = new HashMap<String,String>();
-            if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || (arguments.size() == 0) )
+            if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                (StringUtils.isBlank(rawCitation) && StringUtils.isBlank(doi)) )
                 arguments.put("query.title", title);
         }
-
         if (StringUtils.isNotBlank(journalTitle)) {
             // call based on partial metadata
             if (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) {
@@ -251,7 +254,7 @@ public BiblioItem consolidate(BiblioItem bib, String rawCitation) throws Excepti
                 cntManager.i(ConsolidationCounters.CONSOLIDATION);
             }
 
-            if ( (doi != null) && (cntManager != null) ) {
+            if ( StringUtils.isNotBlank(doi) && (cntManager != null) ) {
                 cntManager.i(ConsolidationCounters.CONSOLIDATION_PER_DOI);
                 doiQuery = true;
             } else {
@@ -380,20 +383,24 @@ public Map<Integer,BiblioItem> consolidate(List<BibDataSet> biblios) {
                 // call with full raw string
                 if (arguments == null)
                     arguments = new HashMap<String,String>();
-                arguments.put("query.bibliographic", rawCitation);
+                if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                     StringUtils.isBlank(doi) )
+                    arguments.put("query.bibliographic", rawCitation);
             }
             if (StringUtils.isNotBlank(title)) {
                 // call based on partial metadata
                 if (arguments == null)
                     arguments = new HashMap<String,String>();
-                if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || (arguments.size() == 0) )
+                if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                     (StringUtils.isBlank(rawCitation) && StringUtils.isBlank(doi)) )
                     arguments.put("query.title", title);
             }
             if (StringUtils.isNotBlank(aut)) {
                 // call based on partial metadata
                 if (arguments == null)
                     arguments = new HashMap<String,String>();
-                if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || (arguments.size() == 0) )
+                if ( (GrobidProperties.getInstance().getConsolidationService() != GrobidConsolidationService.CROSSREF) || 
+                     (StringUtils.isBlank(rawCitation) && StringUtils.isBlank(doi)) )
                     arguments.put("query.author", aut);
             }
             if (StringUtils.isNotBlank(journalTitle)) {
@@ -451,7 +458,7 @@ else if (GrobidProperties.getInstance().getConsolidationService() == GrobidConso
                     cntManager.i(ConsolidationCounters.CONSOLIDATION);
                 }
 
-                if ( (doi != null) && (cntManager != null) ) {
+                if ( StringUtils.isNotBlank(doi) && (cntManager != null) ) {
                     cntManager.i(ConsolidationCounters.CONSOLIDATION_PER_DOI);
                     doiQuery = true;
                 } else {

diff --git a/grobid-core/src/main/java/org/grobid/core/utilities/crossref/CrossrefRequest.java b/grobid-core/src/main/java/org/grobid/core/utilities/crossref/CrossrefRequest.java
@@ -21,6 +21,7 @@
 import org.apache.http.HttpHost;
 import org.apache.http.conn.params.*;
 import org.apache.http.impl.conn.*;
+import org.apache.http.params.HttpProtocolParams;
 
 import org.apache.commons.io.IOUtils;
 import java.net.URL;
@@ -131,9 +132,23 @@ public void execute() {
 						uriBuilder.setParameter(cursor.getKey(), cursor.getValue());
             }
 
+			// "mailto" parameter to be used in the crossref query and in User-Agent 
+     		//  header, as recommended by CrossRef REST API documentation, e.g. &[email protected]
+            if (GrobidProperties.getCrossrefMailto() != null) {
+	            uriBuilder.setParameter("mailto", GrobidProperties.getCrossrefMailto());
+	        }
+
             //System.out.println(uriBuilder.toString());
 
+            // set recommended User-Agent header
             HttpGet httpget = new HttpGet(uriBuilder.build());
+            if (GrobidProperties.getCrossrefMailto() != null) {
+            	httpget.setHeader("User-Agent", 
+            		"GROBID/0.5.5 (https://github.com/kermitt2/grobid; mailto:" + GrobidProperties.getCrossrefMailto() + ")");
+			} else {
+				httpget.setHeader("User-Agent", 
+            		"GROBID/0.5.5 (https://github.com/kermitt2/grobid)");
+			}
 
             ResponseHandler<Void> responseHandler = new ResponseHandler<Void>() {