Skip to content

Commit

Permalink
Adding documentation and requirements files for conda (GPU version is…
Browse files Browse the repository at this point in the history
… still a draft... not yet tested)

Former-commit-id: cd35302
  • Loading branch information
lfoppiano committed Jul 26, 2019
1 parent aba13b8 commit 128d809
Show file tree
Hide file tree
Showing 4 changed files with 211 additions and 10 deletions.
73 changes: 64 additions & 9 deletions doc/Deep-Learning-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,83 @@

Since version 0.5.4, it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models are currently BidLSTM-CRF with Glove embeddings, which can be used as alternative to the default Wapiti CRF.

Note that this only works with Linux 64 bits for the moment (only 64-bits architectures will be supported).
This architecture has been tested on Linux 64bit and macOS.

Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).

There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks.

Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment.

See some evaluation under `grobid-trainer/docs`.

Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach.

To use them:

- install [DeLFT](https://github.com/kermitt2/delft)
### Getting started with DL

#### Classic python

- install [DeLFT](https://github.com/kermitt2/delft)

- indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`)

- change the engine from `wapiti` to `delft` in the `grobid-properties` file

Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).
- run grobid

There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks.
> ./gradlew run
Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment.
#### Anaconda

See some evaluation under `grobid-trainer/docs`.
- create the anaconda environment

CPU only:

> conda create --name grobidDelft --file requirements.conda.delft.cpu.txt
GPU (has not been tested):
> conda create --name grobidDelft --file requirements.conda.delft.gpu.txt
- activate the environment:

> conda activate grobidDelft
- install [DeLFT](https://github.com/kermitt2/delft), ignore the pip command for the installation in the delft documentation

- indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`)

- change the engine from `wapiti` to `delft` in the `grobid-properties` file

- run grobid

> ./gradlew run
Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach.

## Future improvements

ELMo embeddings have not been experimented for the GROBID models yet, but they could make some models better than their CRF counterpart, although probably too slow for concrete usage (it will make these models 100 times slower than the current CRF in our estimate). ELMo embeddings are already integrated in DeLFT.

However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts). We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models.
However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts).
We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models.


## Troubleshooting

1. If there is a dependency problem when JEP starts usually the virtual machine is crashing.
We are still discovering this part, please feel free to submit issues should you incur in these problems.
See discussion [here](https://github.com/kermitt2/grobid/pull/454)

2. If the GLIBC causes an error,
```
! ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/Luca/.conda/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
```

here a quick solution (libgcc should be already installed, if so, just skip that pass):


> conda install libgcc
> export export LD_PRELOAD=$anaconda_path/lib/libstdc++.so.6.0.25
> ./gradlew run
2 changes: 1 addition & 1 deletion grobid-home/config/grobid.properties
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ grobid.crf.engine=delft
grobid.delft.install=../delft
grobid.delft.useELMo=false
grobid.delft.python.virtualEnv=
#/anaconda3/envs/tensorflow
grobid.delft.redirect.output=true
grobid.pdf.blocks.max=100000
grobid.pdf.tokens.max=1000000

Expand Down
73 changes: 73 additions & 0 deletions requirements.conda.delft.cpu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
_tflow_select=2.3.0=mkl
absl-py=0.7.1=py36_0
astor=0.7.1=py36_0
blas=1.0=mkl
c-ares=1.15.0=h7b6447c_1
ca-certificates=2019.5.15=0
certifi=2019.6.16=py36_0
gast=0.2.2=py36_0
google-pasta=0.1.7=py_0
grpcio=1.16.1=py36hf8bcb03_1
h5py=2.9.0=py36h7918eee_0
hdf5=1.10.4=hb1b8bf9_0
intel-openmp=2019.4=243
jep=3.8.2=pypi_0
joblib=0.13.2=pypi_0
keras=2.2.4=0
keras-applications=1.0.8=py_0
keras-base=2.2.4=py36_0
keras-bert=0.70.1=pypi_0
keras-embed-sim=0.7.0=pypi_0
keras-layer-normalization=0.12.0=pypi_0
keras-multi-head=0.20.0=pypi_0
keras-pos-embd=0.11.0=pypi_0
keras-position-wise-feed-forward=0.6.0=pypi_0
keras-preprocessing=1.1.0=py_1
keras-self-attention=0.41.0=pypi_0
keras-transformer=0.29.0=pypi_0
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc=7.2.0=h69d50b8_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libprotobuf=3.8.0=hd408876_0
libstdcxx-ng=9.1.0=hdf63c60_0
markdown=3.1.1=py36_0
mkl=2019.4=243
mkl_fft=1.0.12=py36ha843d7b_0
mkl_random=1.0.2=py36hd81dba3_0
mock=3.0.5=py36_0
ncurses=6.1=he6710b0_1
numpy=1.16.4=py36h7e9f1db_0
numpy-base=1.16.4=py36hde5b4d6_0
openssl=1.1.1c=h7b6447c_1
pip=19.1.1=py36_0
protobuf=3.8.0=py36he6710b0_0
python=3.6.8=h0371630_0
python-lmdb=0.94=py36h14c3975_0
pyyaml=5.1.1=py36h7b6447c_0
readline=7.0=h7b6447c_5
regex=2019.06.05=py36h7b6447c_0
scikit-learn=0.21.2=pypi_0
scipy=1.2.1=py36h7c811a0_0
setuptools=41.0.1=py36_0
six=1.12.0=py36_0
sklearn=0.0=pypi_0
sqlite=3.29.0=h7b6447c_0
tensorboard=1.12.2=py36he6710b0_0
tensorflow=1.12.0=mkl_py36h69b6ba0_0
tensorflow-base=1.12.0=mkl_py36h3c3e929_0
tensorflow-estimator=1.13.0=py_0
termcolor=1.1.0=py36_1
tk=8.6.8=hbc83047_0
tqdm=4.32.1=py_0
werkzeug=0.15.4=py_0
wheel=0.33.4=py36_0
wrapt=1.11.2=py36h7b6447c_0
xz=5.2.4=h14c3975_4
yaml=0.1.7=had09818_2
zlib=1.2.11=h7b6447c_3
73 changes: 73 additions & 0 deletions requirements.conda.delft.gpu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
_tflow_select=2.3.0=mkl
absl-py=0.7.1=py36_0
astor=0.7.1=py36_0
blas=1.0=mkl
c-ares=1.15.0=h7b6447c_1
ca-certificates=2019.5.15=0
certifi=2019.6.16=py36_0
gast=0.2.2=py36_0
google-pasta=0.1.7=py_0
grpcio=1.16.1=py36hf8bcb03_1
h5py=2.9.0=py36h7918eee_0
hdf5=1.10.4=hb1b8bf9_0
intel-openmp=2019.4=243
jep=3.8.2=pypi_0
joblib=0.13.2=pypi_0
keras-gpu=2.2.4=0
keras-applications=1.0.8=py_0
keras-base=2.2.4=py36_0
keras-bert=0.70.1=pypi_0
keras-embed-sim=0.7.0=pypi_0
keras-layer-normalization=0.12.0=pypi_0
keras-multi-head=0.20.0=pypi_0
keras-pos-embd=0.11.0=pypi_0
keras-position-wise-feed-forward=0.6.0=pypi_0
keras-preprocessing=1.1.0=py_1
keras-self-attention=0.41.0=pypi_0
keras-transformer=0.29.0=pypi_0
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc=7.2.0=h69d50b8_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libprotobuf=3.8.0=hd408876_0
libstdcxx-ng=9.1.0=hdf63c60_0
markdown=3.1.1=py36_0
mkl=2019.4=243
mkl_fft=1.0.12=py36ha843d7b_0
mkl_random=1.0.2=py36hd81dba3_0
mock=3.0.5=py36_0
ncurses=6.1=he6710b0_1
numpy=1.16.4=py36h7e9f1db_0
numpy-base=1.16.4=py36hde5b4d6_0
openssl=1.1.1c=h7b6447c_1
pip=19.1.1=py36_0
protobuf=3.8.0=py36he6710b0_0
python=3.6.8=h0371630_0
python-lmdb=0.94=py36h14c3975_0
pyyaml=5.1.1=py36h7b6447c_0
readline=7.0=h7b6447c_5
regex=2019.06.05=py36h7b6447c_0
scikit-learn=0.21.2=pypi_0
scipy=1.2.1=py36h7c811a0_0
setuptools=41.0.1=py36_0
six=1.12.0=py36_0
sklearn=0.0=pypi_0
sqlite=3.29.0=h7b6447c_0
tensorboard=1.12.2=py36he6710b0_0
tensorflow-gpu=1.12.0=mkl_py36h69b6ba0_0
tensorflow-base=1.12.0=mkl_py36h3c3e929_0
tensorflow-estimator=1.13.0=py_0
termcolor=1.1.0=py36_1
tk=8.6.8=hbc83047_0
tqdm=4.32.1=py_0
werkzeug=0.15.4=py_0
wheel=0.33.4=py36_0
wrapt=1.11.2=py36h7b6447c_0
xz=5.2.4=h14c3975_4
yaml=0.1.7=had09818_2
zlib=1.2.11=h7b6447c_3

0 comments on commit 128d809

Please sign in to comment.