Skip to content

Commit

Permalink
Merge branch 'master' into python-envs
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Nov 29, 2024
2 parents bd426c4 + 37dcec1 commit 37cb3d1
Show file tree
Hide file tree
Showing 34 changed files with 771 additions and 216 deletions.
15 changes: 0 additions & 15 deletions .github/ISSUE_TEMPLATE/general-report.md

This file was deleted.

36 changes: 36 additions & 0 deletions .github/ISSUE_TEMPLATE/general-report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: General report
description: Create a report to help us improve
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report! Before creating a new issue, make sure you had a look at the [official documentation](https://grobid.readthedocs.com) or with the **experimental** [Mendable Q/A chat](https://www.mendable.ai/demo/723cfc12-fdd6-4631-9a9e-21b80241131b). **NOTE**: the suggested method of running grobid is through Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/).
- type: input
id: os
attributes:
label: Operating System and architecture (arm64, amd64, x86, etc.)
description: Please remember that Windows is not supported and Mac OS arm64 is still experimental.
validations:
required: false
- type: input
id: java
attributes:
label: What is your Java version
description: "java --version"
validations:
required: false
- type: textarea
id: logs
attributes:
label: Log and information
description: In case of build or run errors, please submit the error while running gradlew with ``--stacktrace`` and ``--info`` for better log traces (e.g. `./gradlew run --stacktrace --info`) or attach the log file `logs/grobid-service.log` or the console log.
validations:
required: false
- type: textarea
id: what-happened
attributes:
label: Further information
description: Please give us any information that could be of help
validations:
required: false

9 changes: 5 additions & 4 deletions .github/workflows/ci-build-manual-crf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@ name: Build and push a CRF-only docker image
on:
workflow_dispatch:
inputs:
suffix:
custom_tag:
type: string
description: Docker image suffix (e.g. develop, crf, full)
required: false
description: Docker image tag
required: true
default: "latest-crf"

jobs:
build:
Expand Down Expand Up @@ -42,6 +43,6 @@ jobs:
registry: docker.io
pushImage: true
tags: |
latest-develop, latest-crf${{ github.event.inputs.suffix != '' && '-' || '' }}${{ github.event.inputs.suffix }}
latest-develop, ${{ github.event.inputs.custom_tag}}
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
12 changes: 9 additions & 3 deletions .github/workflows/ci-build-manual-full.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
name: Build and push a full docker image

on: "workflow_dispatch"

on:
workflow_dispatch:
inputs:
custom_tag:
type: string
description: Docker image tag
required: true
default: "latest-full"

jobs:
build:
Expand Down Expand Up @@ -35,7 +41,7 @@ jobs:
image: lfoppiano/grobid
registry: docker.io
pushImage: true
tags: latest-full
tags: latest-full, ${{ github.event.inputs.custom_tag}}
dockerfile: Dockerfile.delft
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
2 changes: 1 addition & 1 deletion .github/workflows/ci-build-unstable.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on: [ push ]

concurrency:
group: gradle
cancel-in-progress: true
cancel-in-progress: false


jobs:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Thumbs.db
.settings
.classpath
.idea
.vscode
.gradle
**/build
*/out/
Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.8.2] - TBD

### Added
- New model specialisation/variants (flavors) mechanism #1151
- New specialised models for a lightweight processing that covers other type of scientific articles that are not following the general segmentation schema (e.g. corrections, editorial letters, etc.) #1202
- Additional training data covering edge cases where the Data Availability statements are over multiple pages #1200
- Added a flag that allow output the raw copyright information in TEI #1181

### Changed

### Fixed
- Fix URL identification for certain edge cases #1190, #1191, #1185
- Fix fulltext model training data #1107
- Fix header model training data #1128
- Updated the docker image's packages to reduce the vulnerabilities #1173

## [0.8.1] - 2024-09-14

### Added
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile.delft
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ ENTRYPOINT ["/tini", "-s", "--"]

# install JRE, python and other dependencies
RUN apt-get update && \
apt-mark hold libcudnn8 && \
apt-get -y upgrade && \
apt-get -y --no-install-recommends install apt-utils build-essential gcc libxml2 libfontconfig unzip curl \
openjdk-17-jre-headless ca-certificates-java \
Expand Down Expand Up @@ -141,7 +142,7 @@ RUN python3 preload_embeddings.py --registry ./resources-registry.json && \
RUN mkdir delft && \
cp ./resources-registry.json delft/

ENV GROBID_SERVICE_OPTS "--add-opens java.base/java.lang=ALL-UNNAMED"
ENV GROBID_SERVICE_OPTS "--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED"

CMD ["./grobid-service/bin/grobid-service"]

Expand Down
6 changes: 5 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,11 @@ For facilitating the usage GROBID service at scale, we provide clients written i
- <a href="https://github.com/kermitt2/grobid-client-java" target="_blank">Java GROBID client</a>
- <a href="https://github.com/kermitt2/grobid-client-node" target="_blank">Node.js GROBID client</a>

All these clients will take advantage of the multi-threading for scaling large set of PDF processing. As a consequence, they will be much more efficient than the [batch command lines](https://grobid.readthedocs.io/en/latest/Grobid-batch/) (which use only one thread) and should be preferred.
A third party client for Go is available offering functionality similar to the Python client:

- <a href="https://github.com/miku/grobidclient" target="_blank">Go GROBID client</a>

All these clients will take advantage of the multi-threading for scaling large set of PDF processing. As a consequence, they will be much more efficient than the [batch command lines](https://grobid.readthedocs.io/en/latest/Grobid-batch/) (which use only one thread) and should be preferred.

For example, we have been able to run the complete full-text processing at around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client listed above during one week on one 16 CPU machine (16 threads, 32GB RAM, no SDD, articles from mainstream publishers), see [here](https://github.com/kermitt2/grobid/issues/443#issuecomment-505208132) (11.3M PDF were processed in 6 days by 2 servers without interruption).

Expand Down
6 changes: 3 additions & 3 deletions doc/Deep-Learning-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar

## Recommended Deep Learning models

By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing.
By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](Configuration.md#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing.

For current GROBID version 0.8.1, we recommend considering the usage of the following Deep Learning models:

Expand Down Expand Up @@ -46,7 +46,7 @@ However, if you need a "local" library installation and build, prepare a lot of

#### Classic python and Virtualenv

<span>0.</span> Install GROBID as indicated [here](https://grobid.readthedocs.io/en/latest/Install-Grobid/).
<span>0.</span> Install GROBID as indicated [here](Install-Grobid.md).

The following was tested with Java version up to 17.

Expand Down Expand Up @@ -130,7 +130,7 @@ INFO [2020-10-30 23:04:07,756] org.grobid.core.jni.DeLFTModel: Loading DeLFT mo
INFO [2020-10-30 23:04:07,758] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 44
```

It is then possible to [benchmark end-to-end](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future!
It is then possible to [benchmark end-to-end](End-to-end-evaluation.md) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future!

#### Anaconda

Expand Down
2 changes: 1 addition & 1 deletion doc/End-to-end-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ For actual benchmarks, see the [Benchmarking page](Benchmarking.md). We describe

## Datasets

The corpus used for the end-to-end evaluation of Grobid are all available in a single place on Zenodo: https://zenodo.org/record/7708580. Some of these datasets have been further annotated to make the evaluation of certain sub-structures possible (in particular code and data availability sections & funding sections).
The corpus used for the end-to-end evaluation of Grobid are all available in a single place on Zenodo: [https://zenodo.org/record/7708580](https://zenodo.org/record/7708580). Some of these datasets have been further annotated to make the evaluation of certain sub-structures possible (in particular code and data availability sections & funding sections).

These resources are originally published under CC-BY license. Our additional annotations are similarly under CC-BY. We thank NIH, bioRxiv, PLOS and eLife for making these resources Open Access and reusable.

Expand Down
2 changes: 1 addition & 1 deletion doc/Grobid-batch.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<h1>GROBID batch mode</h1>

We do **not** recommend to use the batch mode. For the best performance, benchmarking and for exploiting multithreading, we recommend to use the service mode, see [Use GROBID as a service](Grobid-service.md), and not the batch mode. Clients for GROBID services are provided in [Python](https://github.com/kermitt2/grobid-client-python), [Java](https://github.com/kermitt2/grobid-client-java) and [node.js](https://github.com/kermitt2/grobid-client-node).
We do **not** recommend to use the batch mode. For the best performance, benchmarking and for exploiting multithreading, we recommend to use the service mode, see [Use GROBID as a service](Grobid-service.md), and not the batch mode. Clients for GROBID services are provided in [Python](https://github.com/kermitt2/grobid-client-python), [Java](https://github.com/kermitt2/grobid-client-java), [node.js](https://github.com/kermitt2/grobid-client-node) and [Go](https://github.com/miku/grobidclient).

Using the batch mode is only necessary to create pre-annotated training data. If you do not need good runtime and just need to casually process some inputs, the batch mode is available for convenience.

Expand Down
4 changes: 2 additions & 2 deletions doc/Grobid-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Access the service:
- open the browser at the address `http://localhost:8080`
- the health check will be accessible at the address `http://localhost:8081`

Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/).
Grobid web services are then available as described in the [service documentation](Grobid-service.md).

By default, this image runs Deep Learning models for:

Expand Down Expand Up @@ -113,7 +113,7 @@ Access the service:
- open the browser at the address `http://localhost:8080`
- the health check will be accessible at the address `http://localhost:8081`

Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/).
Grobid web services are then available as described in the [service documentation](Grobid-service.md).


## Configure using the yaml config file
Expand Down
Loading

0 comments on commit 37cb3d1

Please sign in to comment.