Skip to content

miku/grobidclient

Repository files navigation

grobidclient

A Go client library and CLI for GROBID document parsing service. To install the CLI:

$ go install github.com/miku/grodidclient/cmd/grobidcli@latest

This CLI and library includes functionality:

  • to run parsing on a single PDF file
  • to run parsing recursively on files in a directory
  • to convert TEI XML to a JSON format, akin to grobid-tei-xml (Python, cf. #41)

Usage

The CLI allows to access the various services, receive parsed XML or JSON results or to process a complete directory of PDF files (in parallel).

░░      ░░░       ░░░░      ░░░       ░░░        ░░       ░░...
▒  ▒▒▒▒▒▒▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒▒  ▒▒▒▒  ▒...
▓  ▓▓▓   ▓▓       ▓▓▓  ▓▓▓▓  ▓▓       ▓▓▓▓▓▓  ▓▓▓▓▓  ▓▓▓▓  ▓...
█  ████  ██  ███  ███  ████  ██  ████  █████  █████  ████  █...
██      ███  ████  ███      ███       ███        ██       ██...

grobidcli | valid service (-s) names:

  processFulltextDocument
  processHeaderDocument
  processReferences
  processCitationList
  processCitationPatentST36
  processCitationPatentPDF

Note: options passed to grobid API are prefixed with "g-", like "g-ira"

  -H	use sha1 of file contents as the filename
  -O string
    	output directory to write parsed files to
  -P	do a ping, then exit
  -S string
    	server URL (default "http://localhost:8070")
  -T duration
    	client timeout (default 1m0s)
  -W string
    	path to WARC file to extract PDFs and parse them (experimental)
  -c string
    	path to config file, often config.json
  -d string
    	input directory to scan for PDF, txt, or XML files
  -debug
    	use debug result writer, does not create any output files
  -f string
    	single input file to process
  -g-cc
    	grobid: consolidate citations
  -g-ch
    	grobid: consolidate header
  -g-force
    	grobid: force reprocess
  -g-gi
    	grobid: generate ids
  -g-ira
    	grobid: include raw affiliations
  -g-irc
    	grobid: include raw citations
  -g-ss
    	grobid: segment sentences
  -j	output json for a single file
  -n int
    	number of concurrent workers (default 12)
  -r int
    	max retries (default 10)
  -s string
    	a valid service name (default "processFulltextDocument")
  -v	be verbose
  -version
    	show version

Examples:

Process a single PDF file and get back TEI-XML

  $ grobidcli -S localhost:8070 -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

Process a single PDF file and get back JSON

  $ grobidcli -j -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

Process a directory of PDF files

  $ grobidcli -d fixtures

Process a single PDF.

$ grobidcli -f testdata/pdf/062RoisinAronAmericanNaturalist03.pdf | xmllint --format - | head -10
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XML...
        <teiHeader xml:lang="en">
                <fileDesc>
                        <titleStmt>
                                <title level="a" type="main">Split Sex Ratios ...
                                <funder ref="#_ZXgvsGF">
                                        <orgName type="full">Belgian National ...
                                </funder>
                        </titleStmt>

...

Process a single PDF and convert to JSON:

$ grobidcli -j -S http://localhost:8070 -f testdata/pdf/1906.02444.pdf | jq .
{
  "grobid_version": "0.8.0",
  "grobid_ts": "2024-08-27T16:56+0000",
  "header": {
    "authors": [
      {
        "full_name": "Davor Kolar",
        "given_name": "Davor",
        "surname": "Kolar",
        "email": "[email protected]"
      },
      {
        "full_name": "Dragutin Lisjak",
        "given_name": "Dragutin",
        "surname": "Lisjak",
        "email": "[email protected]"
      },
      {
        "full_name": "Michał Paj Ąk",
        "given_name": "Michał",
        "surname": "Paj Ąk"
      },
      {
        "full_name": "Danijel Pavkovic",
        "given_name": "Danijel",
        "surname": "Pavkovic",
        "email": "[email protected]"
      }
    ],
    "date": "2019-06-06",
    "doi": "10.1177/ToBeAssigned",
    "arxiv_id": "1906.02444v1[cs.LG]"
  },
  "pdfmd5": "E04A100BC6A02EFBF791566D6CB62BC9",
  "lang": "en",
  "citations": [
    {
      "authors": [
        {
          "full_name": "O Abdeljaber",
          "given_name": "O",
          "surname": "Abdeljaber"
        },
        {
          "full_name": "O Avci",
          "given_name": "O",
          "surname": "Avci"
        },
        {
          "full_name": "S Kiranyaz",
          "given_name": "S",
          "surname": "Kiranyaz"
        },
        {
          "full_name": "M Gabbouj",
          "given_name": "M",
          "surname": "Gabbouj"
        },
        {
          "full_name": "D J Inman",
          "given_name": "D",
          "middle_name": "J",
          "surname": "Inman"
        }
      ],
      "id": "b0",
      "date": "2017",
      "title": "Real-time vibration-based stru...",
      "journal": "J. Sound Vib",
      "volume": "388",
      "pages": "154-170",
      "first_page": "154",
      "last_page": "170"
    },
    ...
  ],
  "abstract": "Recent trends focusing on Industry 4.0 conce...",
  "body": "Introduction Rotating machines in general consis..."
}

Process pdf files in a directory in parallel.

$ grobidcli -d testdata/pdf
2024/07/30 20:48:35 scanning testdata/pdf/
2024/07/30 20:48:37 got result [200]: testdata/pdf/62-Article Text-140-1-10-20190621.pdf
2024/07/30 20:48:39 got result [200]: testdata/pdf/062RoisinAronAmericanNaturalist03.pdf

By default, for each PDF file a separate file is written to a file with the grobid.tei.xml extension.

Example library usage

Package documentation on pkg.go.dev. Example takes from the grobidcli tool.

import (
    ...
    "fmt"
    "json"
    "log"
    ...

    "github.com/miku/grobidclient"
    "github.com/miku/grobidclient/tei"
)
    ...
    opts := &grobidclient.Options{
        GenerateIDs:            *generateIDs,
        ConsolidateHeader:      *consolidateHeader,
        ConsolidateCitations:   *consolidateCitations,
        IncludeRawCitations:    *includeRawCitations,
        IncluseRawAffiliations: *includeRawAffiliations,
        TEICoordinates:         []string{
            "ref",
            "figure",
            "persName",
            "formula",
            "biblStruct",
        },
        SegmentSentences:       *segmentSentences,
        Force:                  *forceReprocess,
        Verbose:                *verbose,
        OutputDir:              *outputDir,
        CreateHashSymlinks:     *createHashSymlinks,
    }
    switch {
    case *inputFile != "":
        result, err := grobid.ProcessPDF("my.pdf",
            "processFulltextDocument", opts)
        if err != nil {
            log.Fatal(err)
        }
        switch {
        case *jsonFormat:
            doc, err := tei.ParseDocument(
                bytes.NewReader(result.Body))
            if err != nil {
                log.Fatal(err)
            }
            enc := json.NewEncoder(os.Stdout)
            if err := enc.Encode(doc); err != nil {
                log.Fatal(err)
            }
        case result.StatusCode == 200:
            fmt.Println(result.StringBody())
        default:
            log.Fatal(result)
        }
    ...

Notes on server setup

TODO and IDEAS

  • allow to process WARC files
  • allow to group all output from one go into a single file (XML in JSON, really...)

It would be nice to be able to point to a WARC file and parse all found PDFs in that WARC file.

$ grobidcli -W https://is.gd/Jpz7OH -o parsed.json
  • try to cache processing; cache may be keyed on content hash