Skip to content

Latest commit

 

History

History
128 lines (100 loc) · 5.27 KB

readme.md

File metadata and controls

128 lines (100 loc) · 5.27 KB

PAWLS CLI

The PAWLS CLI helps manage annotation tasks based on PDFs.

Installation

  1. Install dependencies

    cd pawls/cli
    python setup.py install
  2. (Optional) Install poppler, the PDF renderer, which is used to export the annotations into a COCO-format Dataset by converting the PDF pages to images. Please follow the instructions here.

  3. (Optional) Install Tesseract, the OCR software, which is used to perform OCR on scanned documents. Please follow the instructions here.

Usage

  1. Place or download PDFs into skiff_files/apps/pawls/papers as described below. If you work at AI2, see the internal usage script for doing this here.

Otherwise, you can add PDFs using the command:

pawls add <pdf-or-directory>

By default, pawls will create a unique id per PDF by hashing the PDF, and use that hash to refer to the PDF in the UI. You can instead retain the original PDF name by passing the --no-hash flag to pawls add.

  1. [preprocess] Process the token information for each PDF document with the given PDF preprocessor.

    pawls preprocess <preprocessor-name> skiff_files/apps/pawls/papers

    Currently we support the following preprocessors:

    1. pdfplumber
    2. grobid Note: to use the grobid preprocessor, you need to run docker-compose up in a separate shell, because grobid needs to be running as a service.
    3. ocr Note: you might need to install tesseract-ocr for using this preprocessor.
  2. [assign] Assign annotation tasks (<PDF_SHA>s) to specific users :

    pawls assign ./skiff_files/apps/pawls/papers <user> <PDF_SHA>

    Optionally at this stage, you can provide a --name-file argument to pawls assign, which allows you to specify a name for a given PDF (for example the title of a paper). This file should be a JSON file containing sha:name mappings.

  3. (optional) [preannotate] Create pre-annotations for the PDFs based on some model predictions anno.json:

    pawls preannotate <labeling_folder> <labeling_config> anno.json -u <user>

    You could find an example for generating the pre-annotations in scripts/generate_pdf_layouts.py.

  4. [status] Check annotation status for the <labeling_folder>:

    pawls status <labeling_folder>
    1. Save the labeling record table:
      pawls status <labeling_folder> --output record.csv
  5. [metric] Check Inter Annotator Agreement (IAA):

    pawls metric <labeling_folder> <config_file> \
        --textual-categories cat1,cat2 --non-textual-categories cat3,cat4

    For blocks, we measure the consistency using the mAP scores. It is a common metric in object detection tasks, evaluating the block category consistency at different overlapping levels.

    For textual regions, we measure the consistency based on the token categories. We will assign PDF tokens with the categories of the contained blocks, and compare the label of the same token across annotators. The agreement level is measured via token accuracy.

    It will print a matrix, where the (i,j)-th element in the table is calculated by treating the annotations from i as the "ground-truth"s, and those from j are considered as "predictions"

    1. Save the IAA report to <save-folder>:

      pawls metric <labeling_folder> <config_file> \
          --textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
          --save <save-folder>

      It will create block-eval.csv and textual-eval.csv in the folder for block and textual region IAA.

    2. Specify annotators for calculating IAA:

      pawls metric <labeling_folder> <config_file> \
          --textual-categories cat1,cat2 --non-textual-categories cat3,cat4 \
          --u <annotator1> --u <annotator2>
  6. [export] Export the annotated dataset to the specified format. Currently we support export to COCO format and the token table format.

    1. Export all annotations of a project of all annotators:

      pawls export <labeling_folder> <labeling_config> <output_path> <format>
    2. Export only finished annotations of a given annotator, e.g. markn:

      pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn
    3. Export all annotations (include unfinished annotations) from a given annotator:

      pawls export <labeling_folder> <labeling_config> <output_path> <format> -u markn --include-unfinished

Dataset structure

PDFs are expected to be in a directory structure with a single PDF per folder, where each folder's name is a unique ID corresponding to that PDF. For example:

    top_level/
    ├───pdf1/
    │     └───pdf1.pdf
    └───pdf2/
          └───pdf2.pdf

Using only pawls add to add PDFs will maintain this structure by default.