grobidclient

A Go client library and CLI for GROBID document parsing service. To install the CLI:

$ go install github.com/miku/grodidclient/cmd/grobidcli@latest

This CLI and library includes functionality:

to run parsing on a single PDF file
to run parsing recursively on files in a directory
to convert TEI XML to a JSON format, akin to grobid-tei-xml (Python, cf. #41)

Usage

The CLI allows to access the various services, receive parsed XML or JSON results or to process a complete directory of PDF files (in parallel).

░░      ░░░       ░░░░      ░░░       ░░░        ░░       ░░...
▒  ▒▒▒▒▒▒▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒  ▒...
▓  ▓▓▓   ▓▓       ▓▓▓  ▓▓▓▓  ▓▓       ▓▓▓▓▓▓  ▓▓▓▓▓  ▓▓▓▓  ▓...
█  ████  ██  ███  ███  ████  ██  ████  █████  █████  ████  █...
██      ███  ████  ███      ███       ███        ██       ██...

grobidcli | valid service (-s) names:

  processFulltextDocument
  processHeaderDocument
  processReferences
  processCitationList
  processCitationPatentST36
  processCitationPatentPDF

Note: options passed to grobid API are prefixed with "g-", like "g-ira"

  -H	use sha1 of file contents as the filename
  -O string
    	output directory to write parsed files to
  -P	do a ping, then exit
  -S string
    	server URL (default "http://localhost:8070")
  -T duration
    	client timeout (default 1m0s)
  -W string
    	path to WARC file to extract PDFs and parse them (experimental)
  -c string
    	path to config file, often config.json
  -d string
    	input directory to scan for PDF, txt, or XML files
  -debug
    	use debug result writer, does not create any output files
  -f string
    	single input file to process
  -g-cc
    	grobid: consolidate citations
  -g-ch
    	grobid: consolidate header
  -g-force
    	grobid: force reprocess
  -g-gi
    	grobid: generate ids
  -g-ira
    	grobid: include raw affiliations
  -g-irc
    	grobid: include raw citations
  -g-ss
    	grobid: segment sentences
  -j	output json for a single file
  -n int
    	number of concurrent workers (default 12)
  -r int
    	max retries (default 10)
  -s string
    	a valid service name (default "processFulltextDocument")
  -v	be verbose
  -version
    	show version

Examples:

Process a single PDF file and get back TEI-XML

  $ grobidcli -S localhost:8070 -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

Process a single PDF file and get back JSON

  $ grobidcli -j -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

Process a directory of PDF files

  $ grobidcli -d fixtures

Process a single PDF.

$ grobidcli -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf | xmllint --format - | head -10
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XML...
        <teiHeader xml:lang="en">
                <fileDesc>
                        <titleStmt>
                                <title level="a" type="main">Split Sex Ratios ...
                                <funder ref="#_ZXgvsGF">
                                        <orgName type="full">Belgian National ...
                                </funder>
                        </titleStmt>

...

Process a single PDF and convert to JSON:

$ grobidcli -j -S http://localhost:8070 -f testdata/pdf/1906.02444.pdf | jq .
{
  "grobid_version": "0.8.0",
  "grobid_ts": "2024-08-27T16:56+0000",
  "header": {
    "authors": [
      {
        "full_name": "Davor Kolar",
        "given_name": "Davor",
        "surname": "Kolar",
        "email": "[email protected]"
      },
      {
        "full_name": "Dragutin Lisjak",
        "given_name": "Dragutin",
        "surname": "Lisjak",
        "email": "[email protected]"
      },
      {
        "full_name": "Michał Paj Ąk",
        "given_name": "Michał",
        "surname": "Paj Ąk"
      },
      {
        "full_name": "Danijel Pavkovic",
        "given_name": "Danijel",
        "surname": "Pavkovic",
        "email": "[email protected]"
      }
    ],
    "date": "2019-06-06",
    "doi": "10.1177/ToBeAssigned",
    "arxiv_id": "1906.02444v1[cs.LG]"
  },
  "pdfmd5": "E04A100BC6A02EFBF791566D6CB62BC9",
  "lang": "en",
  "citations": [
    {
      "authors": [
        {
          "full_name": "O Abdeljaber",
          "given_name": "O",
          "surname": "Abdeljaber"
        },
        {
          "full_name": "O Avci",
          "given_name": "O",
          "surname": "Avci"
        },
        {
          "full_name": "S Kiranyaz",
          "given_name": "S",
          "surname": "Kiranyaz"
        },
        {
          "full_name": "M Gabbouj",
          "given_name": "M",
          "surname": "Gabbouj"
        },
        {
          "full_name": "D J Inman",
          "given_name": "D",
          "middle_name": "J",
          "surname": "Inman"
        }
      ],
      "id": "b0",
      "date": "2017",
      "title": "Real-time vibration-based stru...",
      "journal": "J. Sound Vib",
      "volume": "388",
      "pages": "154-170",
      "first_page": "154",
      "last_page": "170"
    },
    ...
  ],
  "abstract": "Recent trends focusing on Industry 4.0 conce...",
  "body": "Introduction Rotating machines in general consis..."
}

Process pdf files in a directory in parallel.

$ grobidcli -d testdata/pdf
2024/07/30 20:48:35 scanning testdata/pdf/
2024/07/30 20:48:37 got result [200]: testdata/pdf/62-Article Text-140-1-10-20190621.pdf
2024/07/30 20:48:39 got result [200]: testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

By default, for each PDF file a separate file is written to a file with the grobid.tei.xml extension.

Example library usage

Package documentation on pkg.go.dev. Example takes from the grobidcli tool.

import (
    ...
    "fmt"
    "json"
    "log"
    ...

    "github.com/miku/grobidclient"
    "github.com/miku/grobidclient/tei"
)
    ...
    opts := &grobidclient.Options{
        GenerateIDs:            *generateIDs,
        ConsolidateHeader:      *consolidateHeader,
        ConsolidateCitations:   *consolidateCitations,
        IncludeRawCitations:    *includeRawCitations,
        IncluseRawAffiliations: *includeRawAffiliations,
        TEICoordinates:         []string{
            "ref",
            "figure",
            "persName",
            "formula",
            "biblStruct",
        },
        SegmentSentences:       *segmentSentences,
        Force:                  *forceReprocess,
        Verbose:                *verbose,
        OutputDir:              *outputDir,
        CreateHashSymlinks:     *createHashSymlinks,
    }
    switch {
    case *inputFile != "":
        result, err := grobid.ProcessPDF("my.pdf",
            "processFulltextDocument", opts)
        if err != nil {
            log.Fatal(err)
        }
        switch {
        case *jsonFormat:
            doc, err := tei.ParseDocument(
                bytes.NewReader(result.Body))
            if err != nil {
                log.Fatal(err)
            }
            enc := json.NewEncoder(os.Stdout)
            if err := enc.Encode(doc); err != nil {
                log.Fatal(err)
            }
        case result.StatusCode == 200:
            fmt.Println(result.StringBody())
        default:
            log.Fatal(result)
        }
    ...

Notes on server setup

Production Grobid Server Configuration

TODO and IDEAS

allow to process WARC files
allow to group all output from one go into a single file (XML in JSON, really...)

It would be nice to be able to point to a WARC file and parse all found PDFs in that WARC file.

$ grobidcli -W https://is.gd/Jpz7OH -o parsed.json

try to cache processing; cache may be keyed on content hash

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
cmd/grobidcli		cmd/grobidcli
static		static
tei		tei
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
client.go		client.go
client_test.go		client_test.go
config.json.sample		config.json.sample
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grobidclient

Usage

Example library usage

Notes on server setup

TODO and IDEAS

About

Releases

Packages

Contributors 2

Languages

License

miku/grobidclient

Folders and files

Latest commit

History

Repository files navigation

grobidclient

Usage

Example library usage

Notes on server setup

TODO and IDEAS

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages