Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative articles processing flavors #1202

Merged
merged 88 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
32ef9c4
add fulltext extraction without structure
lfoppiano Jun 26, 2024
1041222
add flavoured model selection
lfoppiano Jul 18, 2024
2f506fd
add documentation
lfoppiano Jul 25, 2024
1696743
add segmentation trainer for the light model flavour
lfoppiano Jul 28, 2024
ed18fb8
duplicate training data
lfoppiano Jul 28, 2024
bb57491
add light trainer for light header
lfoppiano Jul 28, 2024
6835fc4
First experiment for light segmentation model
lfoppiano Jul 28, 2024
e26d799
add light header model
lfoppiano Jul 28, 2024
38b7d72
set all body as a single paragraph
lfoppiano Jul 28, 2024
963b169
Update light segmentation model wq
lfoppiano Jul 28, 2024
2ece31e
assemble data
lfoppiano Jul 29, 2024
f30d6f6
add lightweight model training generation
lfoppiano Aug 2, 2024
8758c52
fix text handling
lfoppiano Aug 2, 2024
2599295
include references in the segmentation
lfoppiano Aug 3, 2024
8143cd8
add light segmentation model with references
lfoppiano Aug 3, 2024
17b16ce
set up training batch
lfoppiano Aug 3, 2024
df82c66
add model information in configuration
lfoppiano Aug 3, 2024
733ddd9
add doi and dates in the header light model
lfoppiano Aug 4, 2024
e9bd9ba
add new segmentation model that consider references
lfoppiano Aug 4, 2024
7dd1000
update header model to support also dates and doi
lfoppiano Aug 4, 2024
7ec083f
gitignore
lfoppiano Aug 9, 2024
2686b92
Merge branch 'flavor' into feature/segmentation-light
lfoppiano Aug 9, 2024
08cf47b
resolve conflicts, fix deprecations, improve flow, update model resol…
lfoppiano Aug 9, 2024
7935137
add light model that include references
lfoppiano Aug 10, 2024
831e288
Merge branch 'release-0.8.1' into feature/segmentation-light
lfoppiano Aug 22, 2024
fe2c206
default
lfoppiano Aug 23, 2024
f321602
add DL models
lfoppiano Sep 14, 2024
7d9e088
Merge branch 'master' into feature/segmentation-light
lfoppiano Sep 14, 2024
8711200
Merge branch 'master' into feature/segmentation-light
lfoppiano Sep 20, 2024
29ad1b1
cosmetics
lfoppiano Sep 20, 2024
5fda319
Updated light model
lfoppiano Sep 20, 2024
21b738d
use standard charset class
lfoppiano Sep 21, 2024
fb611ae
add segmentation light with references in training
lfoppiano Sep 21, 2024
fa28ec7
update interface
lfoppiano Sep 21, 2024
92b80bd
update light model with ref
lfoppiano Sep 22, 2024
c195920
update segmentation and header trainers
lfoppiano Sep 22, 2024
20ce009
updated simpler light segmentation model
lfoppiano Sep 22, 2024
ee15d34
updated model for the segmentation light model with references
lfoppiano Sep 23, 2024
500b5e7
improve build
lfoppiano Sep 24, 2024
540dbd8
Merge branch 'refs/heads/master' into feature/segmentation-light
lfoppiano Oct 23, 2024
bbfa442
update documentation and configuration
lfoppiano Oct 23, 2024
a411499
add documentation
lfoppiano Oct 23, 2024
ae8b25c
try to add a link to avoid duplicating the training data
lfoppiano Oct 23, 2024
a7c8a50
refine documentation
lfoppiano Oct 23, 2024
8e26029
update index
lfoppiano Oct 23, 2024
694a2b3
update index
lfoppiano Oct 23, 2024
8ccbf21
apply a custom theme to fix the tables appearance
lfoppiano Oct 23, 2024
0266ad6
update links and doc config
lfoppiano Oct 23, 2024
d952985
fix links
lfoppiano Oct 23, 2024
625e309
update config
lfoppiano Oct 23, 2024
56677c4
fix links
lfoppiano Oct 23, 2024
4b10245
fix links
lfoppiano Oct 23, 2024
34d8b5d
uniform the parameters
lfoppiano Nov 21, 2024
4c935e9
add end2end evaluation for the lighweights flavors
lfoppiano Nov 21, 2024
916a57d
update documentation
lfoppiano Nov 21, 2024
f7cbf76
update documentation
lfoppiano Nov 21, 2024
2365fac
Merge branch 'master' into feature/segmentation-light
lfoppiano Nov 21, 2024
2d6e379
use different grobid names for the generated files when the testing t…
lfoppiano Nov 24, 2024
2049b36
fix arguments
lfoppiano Nov 25, 2024
05c5a5d
fix arguments
lfoppiano Nov 25, 2024
c09aba1
new end 2 end evaluation scores for the flavor models
lfoppiano Nov 25, 2024
b6d4756
Merge branch 'master' into feature/segmentation-light
lfoppiano Nov 25, 2024
6840da6
Merge branch 'flavor' into feature/segmentation-light
lfoppiano Nov 26, 2024
8484a95
more documentation
lfoppiano Nov 27, 2024
b71f4ba
cosmetics
lfoppiano Nov 27, 2024
833130d
cosmetics
lfoppiano Nov 28, 2024
3c36427
Merge branch 'new-training-data-das' into feature/segmentation-light_…
lfoppiano Nov 29, 2024
f33b4fa
preliminary merge of training data
lfoppiano Nov 29, 2024
e30cb58
add more documentation
lfoppiano Nov 29, 2024
bb6ff28
new flavor segmentation models
lfoppiano Nov 29, 2024
b8be5b2
Add description and use cases for the flavors
lfoppiano Nov 30, 2024
797f4d5
updated segmentation light models
lfoppiano Nov 30, 2024
fc31e34
update end2end evaluation for flavors
lfoppiano Dec 1, 2024
57fbe4d
update the training data generation process, remove wrongly hardcoded…
lfoppiano Dec 25, 2024
41784a2
Merge branch 'feature/segmentation-light' into feature/segmentation-l…
lfoppiano Dec 25, 2024
0c7a516
add lightweight flavour for fulltext
lfoppiano Jan 1, 2025
000088a
update build
lfoppiano Jan 1, 2025
9078dba
plug the trainer
lfoppiano Jan 1, 2025
2b034b4
create multiple directories if necessary
lfoppiano Jan 2, 2025
236e68e
fix tags
lfoppiano Jan 2, 2025
36f9aa8
Fulltext lighweight model for fulltext
lfoppiano Jan 4, 2025
0099897
plug in the lightweight fulltext parser
lfoppiano Jan 4, 2025
b90e134
cleanup
lfoppiano Jan 4, 2025
3c2c3d6
Merge branch 'master' into feature/segmentation-light
lfoppiano Jan 6, 2025
a45e2bf
Merge branch 'master' into feature/segmentation-light
lfoppiano Jan 6, 2025
164636b
fix mess from merge
lfoppiano Jan 6, 2025
59916c3
Merge branch 'feature/segmentation-light' into feature/segmentation-l…
lfoppiano Jan 10, 2025
b6584b2
update documentation
lfoppiano Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
74 changes: 41 additions & 33 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,11 @@ subprojects {

test {
useJUnitPlatform()

testLogging.showStandardStreams = true
// enable for having separate test executor for different tests
forkEvery = 1
maxHeapSize = "1024m"
maxHeapSize = "1024m"

def libraries = ""
if (Os.isFamily(Os.FAMILY_MAC)) {
Expand All @@ -199,7 +199,7 @@ subprojects {
} else {
libraries = "${file("./grobid-home/lib/mac-64").absolutePath}"
}
} else if (Os.isFamily(Os.FAMILY_UNIX)) {
} else if (Os.isFamily(Os.FAMILY_UNIX)) {
def jepDir = rootProject.rootDir.getAbsolutePath() + "/grobid-home/lib/lin-64/jep"
libraries = jepDir
jepDir = rootProject.rootDir.getAbsolutePath() + "/grobid-home/lib/lin-64"
Expand All @@ -209,7 +209,7 @@ subprojects {
}

if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs "--add-opens", "java.base/java.util.stream=ALL-UNNAMED",
jvmArgs "--add-opens", "java.base/java.util.stream=ALL-UNNAMED",
"--add-opens", "java.base/java.io=ALL-UNNAMED", "--add-opens", "java.xml/jdk.xml.internal=ALL-UNNAMED"
}
systemProperty "java.library.path","${System.getProperty('java.library.path')}:" + libraries
Expand Down Expand Up @@ -351,7 +351,7 @@ project(":grobid-service") {
} else {
throw new RuntimeException("Unsupported platform!")
}

if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs "--add-opens", "java.base/java.lang=ALL-UNNAMED"
}
Expand Down Expand Up @@ -380,7 +380,7 @@ project(":grobid-service") {
distTar { duplicatesStrategy = DuplicatesStrategy.EXCLUDE }

dependencies {
implementation project(':grobid-core')
implementation project(':grobid-core')
implementation project(':grobid-trainer')

//Dropwizard
Expand All @@ -397,7 +397,7 @@ project(":grobid-service") {
implementation 'io.dropwizard.metrics:metrics-core:4.2.22'
implementation 'io.dropwizard.metrics:metrics-servlets:4.2.22'
implementation 'io.dropwizard:dropwizard-json-logging:4.0.0'

implementation "org.apache.pdfbox:pdfbox:2.0.3"
implementation "javax.activation:activation:1.1.1"
implementation "io.prometheus:simpleclient_dropwizard:0.16.0"
Expand Down Expand Up @@ -500,34 +500,38 @@ project(":grobid-trainer") {
}

def trainerTasks = [
"train_name_header" : "org.grobid.trainer.NameHeaderTrainer",
"train_name_citation" : "org.grobid.trainer.NameCitationTrainer",
"train_affiliation_address" : "org.grobid.trainer.AffiliationAddressTrainer",
// "train_header" : "org.grobid.trainer.HeaderTrainer",
"train_fulltext" : "org.grobid.trainer.FulltextTrainer",
"train_shorttext" : "org.grobid.trainer.ShorttextTrainer",
"train_figure" : "org.grobid.trainer.FigureTrainer",
"train_table" : "org.grobid.trainer.TableTrainer",
"train_citation" : "org.grobid.trainer.CitationTrainer",
"train_date" : "org.grobid.trainer.DateTrainer",
// "train_segmentation" : "org.grobid.trainer.SegmentationTrainer",
"train_reference_segmentation": "org.grobid.trainer.ReferenceSegmenterTrainer",
"train_ebook_model" : "org.grobid.trainer.EbookTrainer",
"train_patent_citation" : "org.grobid.trainer.PatentParserTrainer",
"train_name_header" : "org.grobid.trainer.NameHeaderTrainer",
"train_name_citation" : "org.grobid.trainer.NameCitationTrainer",
"train_affiliation_address" : "org.grobid.trainer.AffiliationAddressTrainer",
"train_shorttext" : "org.grobid.trainer.ShorttextTrainer",
"train_figure" : "org.grobid.trainer.FigureTrainer",
"train_table" : "org.grobid.trainer.TableTrainer",
"train_citation" : "org.grobid.trainer.CitationTrainer",
"train_date" : "org.grobid.trainer.DateTrainer",
"train_reference_segmentation" : "org.grobid.trainer.ReferenceSegmenterTrainer",
"train_ebook_model" : "org.grobid.trainer.EbookTrainer",
"train_patent_citation" : "org.grobid.trainer.PatentParserTrainer",
"train_funding_acknowledgement" : "org.grobid.trainer.FundingAcknowledgementTrainer"
]

def complexTrainerTasks = [
"train_header" : ["org.grobid.trainer.HeaderTrainer", ""],
"train_header_ietf" : ["org.grobid.trainer.HeaderTrainer", "sdo/ietf"],
"train_segmentation" : ["org.grobid.trainer.SegmentationTrainer", ""],
"train_segmentation_ietf" : ["org.grobid.trainer.SegmentationTrainer", "sdo/ietf"]
"train_header" : ["org.grobid.trainer.HeaderTrainer", ""],
"train_header_article_light" : ["org.grobid.trainer.HeaderTrainer", "article/light"],
"train_header_article_light_ref" : ["org.grobid.trainer.HeaderTrainer", "article/light-ref"],
"train_header_ietf" : ["org.grobid.trainer.HeaderTrainer", "sdo/ietf"],
"train_segmentation" : ["org.grobid.trainer.SegmentationTrainer", ""],
"train_segmentation_article_light" : ["org.grobid.trainer.SegmentationTrainer", "article/light"],
"train_segmentation_article_light_ref" : ["org.grobid.trainer.SegmentationTrainer", "article/light-ref"],
"train_segmentation_ietf" : ["org.grobid.trainer.SegmentationTrainer", "sdo/ietf"],
"train_fulltext" : ["org.grobid.trainer.FulltextTrainer", ""],
"train_fulltext_article_light" : ["org.grobid.trainer.FulltextTrainer", "article/light"],
"train_fulltext_article_light_ref" : ["org.grobid.trainer.FulltextTrainer", "article/light-ref"],
]

def libraries = ""
if (Os.isFamily(Os.FAMILY_MAC)) {
if (Os.OS_ARCH.equals("aarch64")) {
libraries = "${file("../grobid-home/lib/mac_arm-64").absolutePath}"
libraries = "${file("../grobid-home/lib/mac_arm-64").absolutePath}"
} else {
libraries = "${file("../grobid-home/lib/mac-64").absolutePath}"
}
Expand All @@ -537,13 +541,16 @@ project(":grobid-trainer") {
} else {
throw new RuntimeException("Unsupported platform!")
}

trainerTasks.each { taskName, mainClassName ->
tasks.create(name: taskName, type: JavaExec, group: 'modeltraining') {
main = mainClassName
classpath = sourceSets.main.runtimeClasspath
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0)
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs '-Xmx3072m', "--add-opens", "java.base/java.lang=ALL-UNNAMED"
} else {
jvmArgs '-Xmx3072m'
}
systemProperty "java.library.path","${System.getProperty('java.library.path')}:" + libraries
}
}
Expand All @@ -552,10 +559,11 @@ project(":grobid-trainer") {
tasks.create(name: taskName, type: JavaExec, group: 'modeltraining') {
main = mainClassNameAndArgs[0]
classpath = sourceSets.main.runtimeClasspath
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0)
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs '-Xmx3072m', "--add-opens", "java.base/java.lang=ALL-UNNAMED"
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0)
jvmArgs '-Xmx3072m', "--add-opens", "java.base/java.lang=ALL-UNNAMED"
} else {
jvmArgs '-Xmx3072m'
}
args mainClassNameAndArgs[1]
}
}
Expand All @@ -574,7 +582,7 @@ project(":grobid-trainer") {
task(jatsEval, dependsOn: 'classes', type: JavaExec, group: 'modelevaluation') {
main = 'org.grobid.trainer.evaluation.EndToEndEvaluation'
classpath = sourceSets.main.runtimeClasspath
args 'nlm', getArg('p2t', '.'), getArg('run', '0'), getArg('fileRatio', '1.0')
args 'nlm', getArg('p2t', '.'), getArg('run', '0'), getArg('fileRatio', '1.0'), getArg('flavor', '')
if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs '-Xmx3072m', "--add-opens", "java.base/java.lang=ALL-UNNAMED"
} else {
Expand All @@ -586,7 +594,7 @@ project(":grobid-trainer") {
task(teiEval, dependsOn: 'classes', type: JavaExec, group: 'modelevaluation') {
main = 'org.grobid.trainer.evaluation.EndToEndEvaluation'
classpath = sourceSets.main.runtimeClasspath
args 'tei', getArg('p2t', '.'), getArg('run', '0'), getArg('fileRatio', '1.0')
args 'tei', getArg('p2t', '.'), getArg('run', '0'), getArg('fileRatio', '1.0'), getArg('flavor', '')
if(JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs '-Xmx3072m', "--add-opens", "java.base/java.lang=ALL-UNNAMED"
} else {
Expand Down
6 changes: 3 additions & 3 deletions doc/Benchmarking-biorxiv.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Evaluation on 2000 PDF preprints out of 2000 (no failure).

Runtime for processing 2000 PDF: **1713** seconds (0.85 seconds per PDF file) on Ubuntu 22.04, 16 CPU (32 threads), 128GB RAM and with a GeForce GTX 1080 Ti GPU.

Note: with CRF only models runtime is 622s (0.31 second per PDF) with 4GPU, 8 threads.
Note: with CRF only models runtime is 622s (0.31 second per PDF) with 4 CPU, 8 threads.


## Header metadata
Expand All @@ -35,14 +35,14 @@ Evaluation on 2000 random PDF files out of 1998 PDF (ratio 1.0).

**Field-level results**

| label | precision | recall | f1 | support |
| label | precision | recall | f1 | support |
|--- |--- |--- |--- |--- |
| abstract | 2.36 | 2.31 | 2.34 | 1989 |
| authors | 84.3 | 83.58 | 83.94 | 1998 |
| first_author | 96.97 | 96.24 | 96.61 | 1996 |
| keywords | 58.9 | 59.95 | 59.42 | 839 |
| title | 77.77 | 76.99 | 77.38 | 1999 |
| | | | | |
| | | | | |
| **all fields (micro avg.)** | **64.95** | **64.38** | **64.66** | 8821 |
| all fields (macro avg.) | 64.06 | 63.82 | 63.94 | 8821 |

Expand Down
2 changes: 1 addition & 1 deletion doc/Benchmarking-elife.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Evaluation on 984 PDF preprints out of 984 (no failure).

Runtime for processing 984 PDF: **1131** seconds (1.15 seconds per PDF file) on Ubuntu 22.04, 16 CPU (32 threads), 128GB RAM and with a GeForce GTX 1080 Ti GPU.

Note: with CRF only models runtime is 492s (0.50 seconds per PDF) with 4GPU, 8 threads.
Note: with CRF only models runtime is 492s (0.50 seconds per PDF) with 4 CPU, 8 threads.



Expand Down
2 changes: 1 addition & 1 deletion doc/Benchmarking-plos.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Evaluation on 1000 PDF preprints out of 1000 (no failure).

Runtime for processing 1000 PDF: **999** seconds, (0.99 seconds per PDF) on Ubuntu 22.04, 16 CPU (32 threads), 128GB RAM and with a GeForce GTX 1080 Ti GPU.

Note: with CRF only models runtime is 304s (0.30 seconds per PDF) with 4GPU, 8 threads.
Note: with CRF only models runtime is 304s (0.30 seconds per PDF) with 4 CPU, 8 threads.


## Header metadata
Expand Down
2 changes: 1 addition & 1 deletion doc/Benchmarking-pmc.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Evaluation on 1943 random PDF PMC files out of 1943 PDF from 1943 different jour

Runtime for processing 1943 PDF: **1467** seconds, (0.75s per PDF) on Ubuntu 22.04, 16 CPU (32 threads), 128GB RAM and with a GeForce GTX 1080 Ti GPU.

Note: with CRF only models, runtime is 470s (0.24 seconds per PDF) with 4GPU, 8 threads.
Note: with CRF only models, runtime is 470s (0.24 seconds per PDF) with 4 CPU, 8 threads.



Expand Down
Loading
Loading