Adding documentation and requirements files for conda (GPU version is…

… still a draft... not yet tested) Former-commit-id: cd35302
kermitt2 · Jul 26, 2019 · 128d809 · 128d809
1 parent aba13b8
commit 128d809
Show file tree

Hide file tree

Showing 4 changed files with 211 additions and 10 deletions.
diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md
@@ -4,28 +4,83 @@
 
 Since version 0.5.4, it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft).  The available neural models are currently BidLSTM-CRF with Glove embeddings, which can be used as alternative to the default Wapiti CRF.
 
-Note that this only works with Linux 64 bits for the moment (only 64-bits architectures will be supported). 
+This architecture has been tested on Linux 64bit and macOS.   
+
+Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).
+
+There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks.
+
+Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment.
+
+See some evaluation under `grobid-trainer/docs`.
+
+Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach.
 
-To use them:
 
-- install [DeLFT](https://github.com/kermitt2/delft)
+### Getting started with DL
+
+#### Classic python 
+
+- install [DeLFT](https://github.com/kermitt2/delft) 
 
 - indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`)
 
 - change the engine from `wapiti` to `delft` in the `grobid-properties` file
 
-Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).
+- run grobid 
 
-There are no neural model for the segmentation and the fulltext models, because the input sequence is far too big. The problem would need to be formulated differently for these tasks.
+> ./gradlew run
 
-Low level models not using layout features (author name, dates, affliliations...) perform similarly as CRF, but CRF is much better when layout features are involved (in particular for the header model). However, the neural models do not use additional features for the moment.
+#### Anaconda 
 
-See some evaluation under `grobid-trainer/docs`.
+- create the anaconda environment
+
+CPU only: 
+
+> conda create --name grobidDelft --file requirements.conda.delft.cpu.txt
+
+GPU (has not been tested): 
+> conda create --name grobidDelft --file requirements.conda.delft.gpu.txt
+
+- activate the environment: 
+
+> conda activate grobidDelft
+
+- install [DeLFT](https://github.com/kermitt2/delft), ignore the pip command for the installation in the delft documentation
+
+- indicate the path of the DeLFT install in `grobid.properties` (`grobid-home/config/grobid.properties`)
+
+- change the engine from `wapiti` to `delft` in the `grobid-properties` file
+
+- run grobid
+
+> ./gradlew run
 
-Current neural models are 3-4 time slower than CRF: we do not use batch processing for the moment. It is not clear how to use batch processing with a cascading approach.
 
 ## Future improvements
 
 ELMo embeddings have not been experimented for the GROBID models yet, but they could make some models better than their CRF counterpart, although probably too slow for concrete usage (it will make these models 100 times slower than the current CRF in our estimate). ELMo embeddings are already integrated in DeLFT.
 
-However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts). We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models. 
+However, we have also recently experimented with BERT fine-tuning for sequence labelling and more particularly with [SciBERT](https://github.com/allenai/scibert) (a BERT base model trained on Wikipedia and some semantic-scholar full texts). 
+We got excellent results with a runtime close to RNN with Gloves embeddings (20 times faster than with ELMo embeddings). This is the target architectures for future GROBID Deep Learning models. 
+
+
+## Troubleshooting
+
+1. If there is a dependency problem when JEP starts usually the virtual machine is crashing. 
+We are still discovering this part, please feel free to submit issues should you incur in these problems. 
+See discussion [here](https://github.com/kermitt2/grobid/pull/454)
+
+2. If the GLIBC causes an error,  
+```
+! ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/Luca/.conda/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
+```
+
+here a quick solution (libgcc should be already installed, if so, just skip that pass): 
+
+
+> conda install libgcc
+
+> export export LD_PRELOAD=$anaconda_path/lib/libstdc++.so.6.0.25
+    
+> ./gradlew run
diff --git a/grobid-home/config/grobid.properties b/grobid-home/config/grobid.properties
@@ -37,7 +37,7 @@ grobid.crf.engine=delft
 grobid.delft.install=../delft
 grobid.delft.useELMo=false
 grobid.delft.python.virtualEnv=
-#/anaconda3/envs/tensorflow
+grobid.delft.redirect.output=true
 grobid.pdf.blocks.max=100000
 grobid.pdf.tokens.max=1000000
 

diff --git a/requirements.conda.delft.cpu.txt b/requirements.conda.delft.cpu.txt
@@ -0,0 +1,73 @@
+# This file may be used to create an environment using:
+# $ conda create --name <env> --file <this file>
+# platform: linux-64
+_libgcc_mutex=0.1=main
+_tflow_select=2.3.0=mkl
+absl-py=0.7.1=py36_0
+astor=0.7.1=py36_0
+blas=1.0=mkl
+c-ares=1.15.0=h7b6447c_1
+ca-certificates=2019.5.15=0
+certifi=2019.6.16=py36_0
+gast=0.2.2=py36_0
+google-pasta=0.1.7=py_0
+grpcio=1.16.1=py36hf8bcb03_1
+h5py=2.9.0=py36h7918eee_0
+hdf5=1.10.4=hb1b8bf9_0
+intel-openmp=2019.4=243
+jep=3.8.2=pypi_0
+joblib=0.13.2=pypi_0
+keras=2.2.4=0
+keras-applications=1.0.8=py_0
+keras-base=2.2.4=py36_0
+keras-bert=0.70.1=pypi_0
+keras-embed-sim=0.7.0=pypi_0
+keras-layer-normalization=0.12.0=pypi_0
+keras-multi-head=0.20.0=pypi_0
+keras-pos-embd=0.11.0=pypi_0
+keras-position-wise-feed-forward=0.6.0=pypi_0
+keras-preprocessing=1.1.0=py_1
+keras-self-attention=0.41.0=pypi_0
+keras-transformer=0.29.0=pypi_0
+libedit=3.1.20181209=hc058e9b_0
+libffi=3.2.1=hd88cf55_4
+libgcc=7.2.0=h69d50b8_2
+libgcc-ng=9.1.0=hdf63c60_0
+libgfortran-ng=7.3.0=hdf63c60_0
+libprotobuf=3.8.0=hd408876_0
+libstdcxx-ng=9.1.0=hdf63c60_0
+markdown=3.1.1=py36_0
+mkl=2019.4=243
+mkl_fft=1.0.12=py36ha843d7b_0
+mkl_random=1.0.2=py36hd81dba3_0
+mock=3.0.5=py36_0
+ncurses=6.1=he6710b0_1
+numpy=1.16.4=py36h7e9f1db_0
+numpy-base=1.16.4=py36hde5b4d6_0
+openssl=1.1.1c=h7b6447c_1
+pip=19.1.1=py36_0
+protobuf=3.8.0=py36he6710b0_0
+python=3.6.8=h0371630_0
+python-lmdb=0.94=py36h14c3975_0
+pyyaml=5.1.1=py36h7b6447c_0
+readline=7.0=h7b6447c_5
+regex=2019.06.05=py36h7b6447c_0
+scikit-learn=0.21.2=pypi_0
+scipy=1.2.1=py36h7c811a0_0
+setuptools=41.0.1=py36_0
+six=1.12.0=py36_0
+sklearn=0.0=pypi_0
+sqlite=3.29.0=h7b6447c_0
+tensorboard=1.12.2=py36he6710b0_0
+tensorflow=1.12.0=mkl_py36h69b6ba0_0
+tensorflow-base=1.12.0=mkl_py36h3c3e929_0
+tensorflow-estimator=1.13.0=py_0
+termcolor=1.1.0=py36_1
+tk=8.6.8=hbc83047_0
+tqdm=4.32.1=py_0
+werkzeug=0.15.4=py_0
+wheel=0.33.4=py36_0
+wrapt=1.11.2=py36h7b6447c_0
+xz=5.2.4=h14c3975_4
+yaml=0.1.7=had09818_2
+zlib=1.2.11=h7b6447c_3
diff --git a/requirements.conda.delft.gpu.txt b/requirements.conda.delft.gpu.txt
@@ -0,0 +1,73 @@
+# This file may be used to create an environment using:
+# $ conda create --name <env> --file <this file>
+# platform: linux-64
+_libgcc_mutex=0.1=main
+_tflow_select=2.3.0=mkl
+absl-py=0.7.1=py36_0
+astor=0.7.1=py36_0
+blas=1.0=mkl
+c-ares=1.15.0=h7b6447c_1
+ca-certificates=2019.5.15=0
+certifi=2019.6.16=py36_0
+gast=0.2.2=py36_0
+google-pasta=0.1.7=py_0
+grpcio=1.16.1=py36hf8bcb03_1
+h5py=2.9.0=py36h7918eee_0
+hdf5=1.10.4=hb1b8bf9_0
+intel-openmp=2019.4=243
+jep=3.8.2=pypi_0
+joblib=0.13.2=pypi_0
+keras-gpu=2.2.4=0
+keras-applications=1.0.8=py_0
+keras-base=2.2.4=py36_0
+keras-bert=0.70.1=pypi_0
+keras-embed-sim=0.7.0=pypi_0
+keras-layer-normalization=0.12.0=pypi_0
+keras-multi-head=0.20.0=pypi_0
+keras-pos-embd=0.11.0=pypi_0
+keras-position-wise-feed-forward=0.6.0=pypi_0
+keras-preprocessing=1.1.0=py_1
+keras-self-attention=0.41.0=pypi_0
+keras-transformer=0.29.0=pypi_0
+libedit=3.1.20181209=hc058e9b_0
+libffi=3.2.1=hd88cf55_4
+libgcc=7.2.0=h69d50b8_2
+libgcc-ng=9.1.0=hdf63c60_0
+libgfortran-ng=7.3.0=hdf63c60_0
+libprotobuf=3.8.0=hd408876_0
+libstdcxx-ng=9.1.0=hdf63c60_0
+markdown=3.1.1=py36_0
+mkl=2019.4=243
+mkl_fft=1.0.12=py36ha843d7b_0
+mkl_random=1.0.2=py36hd81dba3_0
+mock=3.0.5=py36_0
+ncurses=6.1=he6710b0_1
+numpy=1.16.4=py36h7e9f1db_0
+numpy-base=1.16.4=py36hde5b4d6_0
+openssl=1.1.1c=h7b6447c_1
+pip=19.1.1=py36_0
+protobuf=3.8.0=py36he6710b0_0
+python=3.6.8=h0371630_0
+python-lmdb=0.94=py36h14c3975_0
+pyyaml=5.1.1=py36h7b6447c_0
+readline=7.0=h7b6447c_5
+regex=2019.06.05=py36h7b6447c_0
+scikit-learn=0.21.2=pypi_0
+scipy=1.2.1=py36h7c811a0_0
+setuptools=41.0.1=py36_0
+six=1.12.0=py36_0
+sklearn=0.0=pypi_0
+sqlite=3.29.0=h7b6447c_0
+tensorboard=1.12.2=py36he6710b0_0
+tensorflow-gpu=1.12.0=mkl_py36h69b6ba0_0
+tensorflow-base=1.12.0=mkl_py36h3c3e929_0
+tensorflow-estimator=1.13.0=py_0
+termcolor=1.1.0=py36_1
+tk=8.6.8=hbc83047_0
+tqdm=4.32.1=py_0
+werkzeug=0.15.4=py_0
+wheel=0.33.4=py36_0
+wrapt=1.11.2=py36h7b6447c_0
+xz=5.2.4=h14c3975_4
+yaml=0.1.7=had09818_2
+zlib=1.2.11=h7b6447c_3