-
Notifications
You must be signed in to change notification settings - Fork 467
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refine crossref call; documentation on deep learning models and conso…
…lidation Former-commit-id: a01aa93
- Loading branch information
Showing
8 changed files
with
186 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Consolidation | ||
|
||
In GROBID, we call __consolidation__ the usage of an external bibliographical service to correct and complement the results extracted by the tool. GROBID extracts usually in a relatively reliable manner a core of bibliographical information, which can be used to match complete bibliographical records made available by these services. | ||
|
||
Consolidation has two main interests: | ||
|
||
* it improves very significantly the retrieval of header information (+.12 to .13 in f-score, e.g. from 74.59 f-score in average for all fields with Ratcliff/Obershelp similarity at 0.95, to 86.62 f-score, using biblio-glutton and GROBID version 0.5.5 for the PMC 1942 dataset, see the [benchmarking documentation](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) and [reports](https://github.com/kermitt2/grobid/tree/master/grobid-trainer/doc)), | ||
|
||
* it matches an extracted bibliographical reference with known publications, and complement the parsed bibliographical reference with various metadata, in particular DOI, making possible the creation of a citation graph and to link the extracted references to external services. | ||
|
||
GROBID supports two consolidation services: | ||
|
||
* [CrossRef REST API](https://github.com/CrossRef/rest-api-doc) (default) | ||
|
||
* [biblio-glutton](https://github.com/kermitt2/biblio-glutton) | ||
|
||
## CrossRef REST API | ||
|
||
The advantage of __CrossRef__ is that it is available without any further installation. It has however a limited query rate (in practice around 25 queries per second), which make scaling impossible when processing bibliographical references for several documents processed in parallel. In addition, it provides metadata limited by what is available at CrossRef. | ||
|
||
For using [reliably and politely the CrossRef REST API](https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service), it is highly recommended to add a contact email to the queries. This is done in GROBID by modifying the properties file under `grobid-home/config/grobid.properties`: | ||
|
||
``` | ||
[email protected] | ||
``` | ||
|
||
Without indicating this email, the service might be unreliable with numerous query failures over time. The usage of the CrossRef REST API by GROBID respects the query rate indicated by the service dynamically by each response. Therefore, there should not be any issues reported by CrossRef via this email. | ||
|
||
## biblio-glutton | ||
|
||
This service presents several advantages as compared to the CrossRef service. biblio-glutton can scale as required by adding more Elasticsearch nodes, allowing the processing of several PDF per second. The metadata provided by the service are richer: in addition to the CrossRef metadata, biblio-glutton also returns the PubMed and PubMed Central identifiers, ISTEX identifiers, PII, and the URL of the Open Access version of the full text following the Unpaywall dataset. Finally, the bibliographical reference matching is [slighty more reliable](https://github.com/kermitt2/biblio-glutton#matching-accuracy). | ||
|
||
Unfortunately, you need to install the service yourself, including loading and indexing the bibliographical resources, as documented [here](https://github.com/kermitt2/biblio-glutton#building-the-bibliographical-data-look-up-and-matching-databases). Note that a [docker container](https://github.com/kermitt2/biblio-glutton#running-with-docker) is available. | ||
|
||
After installing biblio-glutton, you need to select the glutton matching service in the `grobid-home/config/grobid.properties` file, with its host and port: | ||
|
||
``` | ||
#-------------------- consolidation -------------------- | ||
# Define the bibliographical data consolidation service to be used, eiter "crossref" for CrossRef REST API or "glutton" for https://github.com/kermitt2/biblio-glutton | ||
#grobid.consolidation.service=crossref | ||
grobid.consolidation.service=glutton | ||
org.grobid.glutton.host=localhost | ||
org.grobid.glutton.port=8080 | ||
``` | ||
|
||
Note that the GROBID online demo hosted [here](http://grobid.science-miner.com) uses biblio-glutton as consolidation service. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Using Deep Learning models instead of default CRF | ||
|
||
## Integration with DeLFT | ||
|
||
Since version 0.5.4, it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models are currently BidLSTM-CRF with Glove embeddings, which can be used as alternative to the default Wapiti CRF. | ||
|
||
Note that this only works with Linux 64 bits for the moment (only 64-bits architectures will be supported). | ||
|
||
To use them: | ||
|
||
- install [DeLFT](https://github.com/kermitt2/delft) | ||
|
||
- indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`) | ||
|
||
- change the engine from `wapiti` to `delft` in the `grobid-properties` file | ||
|
||
Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based). | ||
|
||
There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks. | ||
|
||
Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment. | ||
|
||
See some evaluation under `grobid-trainer/docs`. | ||
|
||
Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach. | ||
|
||
## Future improvements | ||
|
||
ELMo embeddings have not been experimented for the GROBID models yet, but they could make some models better than their CRF counterpart, although probably too slow for concrete usage (it will make these models 100 times slower than the current CRF in our estimate). ELMo embeddings are already integrated in DeLFT. | ||
|
||
However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts). We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
<h1>Principles</h1> | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
import org.apache.http.HttpHost; | ||
import org.apache.http.conn.params.*; | ||
import org.apache.http.impl.conn.*; | ||
import org.apache.http.params.HttpProtocolParams; | ||
|
||
import org.apache.commons.io.IOUtils; | ||
import java.net.URL; | ||
|
@@ -131,9 +132,23 @@ public void execute() { | |
uriBuilder.setParameter(cursor.getKey(), cursor.getValue()); | ||
} | ||
|
||
// "mailto" parameter to be used in the crossref query and in User-Agent | ||
// header, as recommended by CrossRef REST API documentation, e.g. &[email protected] | ||
if (GrobidProperties.getCrossrefMailto() != null) { | ||
uriBuilder.setParameter("mailto", GrobidProperties.getCrossrefMailto()); | ||
} | ||
|
||
//System.out.println(uriBuilder.toString()); | ||
|
||
// set recommended User-Agent header | ||
HttpGet httpget = new HttpGet(uriBuilder.build()); | ||
if (GrobidProperties.getCrossrefMailto() != null) { | ||
httpget.setHeader("User-Agent", | ||
"GROBID/0.5.5 (https://github.com/kermitt2/grobid; mailto:" + GrobidProperties.getCrossrefMailto() + ")"); | ||
} else { | ||
httpget.setHeader("User-Agent", | ||
"GROBID/0.5.5 (https://github.com/kermitt2/grobid)"); | ||
} | ||
|
||
ResponseHandler<Void> responseHandler = new ResponseHandler<Void>() { | ||
|
||
|
Oops, something went wrong.