Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative articles processing flavors #1202

Merged
merged 88 commits into from
Jan 10, 2025
Merged

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Nov 21, 2024

This PR implements two alternatives segmentation flavors for the scientific articles:

  • article/light: Segment the document into header and body, and extract only title, authors, DOI, and publication date if available, body is segmented in paragraphs only.
  • article/light-ref: Segment the document into header, body and references, and extract title, authors, DOI, and publication date if available, body is segmented in paragraphs only, references are processed as usual.

The article's body is then composed by two paragraphs:
The first paragraph contains leftover from the header, since the extraction may be sparse, this avoids to miss data
The second paragraph contains the full body
The article body is segmented into head and paragraphs. Tables and figures are embedded into the paragraphs too.

The PR #1151 was tested in this PR.

@lfoppiano
Copy link
Collaborator Author

here the updated documentation page with all the evaluation

Base automatically changed from flavor to master January 6, 2025 16:48
# Conflicts:
#	doc/Benchmarking-biorxiv.md
#	grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
# Conflicts:
#	build.gradle
#	doc/Grobid-specialized-processes.md
#	grobid-core/src/main/java/org/grobid/core/GrobidModels.java
#	grobid-core/src/main/java/org/grobid/core/engines/Engine.java
#	grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java
#	grobid-core/src/main/java/org/grobid/core/main/batch/GrobidMain.java
#	grobid-service/src/main/java/org/grobid/service/GrobidRestService.java
#	grobid-trainer/src/main/java/org/grobid/trainer/HeaderTrainer.java
#	grobid-trainer/src/main/java/org/grobid/trainer/SegmentationTrainer.java
#	grobid-trainer/src/main/java/org/grobid/trainer/TrainerRunner.java
@lfoppiano lfoppiano merged commit a6bea43 into master Jan 10, 2025
4 of 5 checks passed
@lfoppiano lfoppiano deleted the feature/segmentation-light branch January 10, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants