***** Updated on Feb 6th, 2021: Train Ticket dataset is now available for academic research. You can download from Google Drive or OneDrive. It contains 1,530 synthetic images and 320 real images for training, and 80 real images for testing. Please refer to our paper for more details about how to sample training/testing set from EATEN and generate the corresponding annotations.*****
***** Updated on Sep 17th, 2020: A training example on the large-scale document understanding dataset, DocBank, is now available. Please refer to examples/DocBank/README.md for more details. Thanks TengQi Ye for this contribution.*****
PyTorch reimplementation of "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020). This project is different from our original implementation.
- Contents
PICK is a framework that is effective and robust in handling complex documents layout for Key Information Extraction (KIE) by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Overall architecture shown follows.
- python = 3.6
- torchvision = 0.6.1
- tabulate = 0.8.7
- overrides = 3.0.0
- opencv_python = 4.3.0.36
- numpy = 1.16.4
- pandas = 1.0.5
- allennlp = 1.0.0
- torchtext = 0.6.0
- tqdm = 4.47.0
- torch = 1.5.1
pip install -r requirements.txt
Modify the configurations in config.json
and dist_train.sh
files, then run:
bash dist_train.sh
The application will be launched via launch.py
on a 4 GPU node with one process per GPU (recommend).
This is equivalent to
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=4 \
--master_addr=127.0.0.1 --master_port=5555 \
train.py -c config.json -d 1,2,3,4 --local_world_size 4
and is equivalent to specify indices of available GPUs by CUDA_VISIBLE_DEVICES
instead of -d
args
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=4 \
--master_addr=127.0.0.1 --master_port=5555 \
train.py -c config.json --local_world_size 4
Similarly, it can be launched with a single process that spans all 4 GPUs (if node has 4 available GPUs) using (don't recommend):
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=1 \
--master_addr=127.0.0.1 --master_port=5555 \
train.py -c config.json --local_world_size 1
You can enable multi-node multi-GPU training by setting nnodes
and node_rank
args of the commandline line on every node.
e.g., 2 nodes 4 gpus run as follows
Node 1, ip: 192.168.0.10, then run on node 1 as follows
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node=4 \
--master_addr=192.168.0.10 --master_port=5555 \
train.py -c config.json --local_world_size 4
Node 2, ip: 192.168.0.15, then run on node 2 as follows
CUDA_VISIBLE_DEVICES=2,4,6,7 python -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node=4 \
--master_addr=192.168.0.10 --master_port=5555 \
train.py -c config.json --local_world_size 4
You can resume from a previously saved checkpoint by:
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=4 \
--master_addr=127.0.0.1 --master_port=5555 \
train.py -d 1,2,3,4 --local_world_size 4 --resume path/to/checkpoint
This option of training mode can debug code without distributed way. -dist
must set to false
to
turn off distributed mode. -d
specify which one gpu will be used.
python train.py -c config.json -d 1 -dist false
You can test from a previously saved checkpoint by:
python test.py --checkpoint path/to/checkpoint --boxes_transcripts path/to/boxes_transcripts \
--images_path path/to/images_path --output_folder path/to/output_folder \
--gpu 0 --batch_size 2
You can train your own datasets following the steps outlined below.
- Prepare the correct format of files as provided in
data
folder.- Please see data/README.md an instruction how to prepare the data in required format for PICK.
- Modify
train_dataset
andvalidation_dataset
args inconfig.json
file, includingfiles_name
,images_folder
,boxes_and_transcripts_folder
,entities_folder
,iob_tagging_type
andresized_image_size
. - Modify
Entities_list
inutils/entities_list.py
file according to the entity type of your dataset. - Modify
keys.txt
inutils/keys.txt
file if needed according to the vocabulary of your dataset. - Modify
MAX_BOXES_NUM
andMAX_TRANSCRIPT_LEN
indata_tuils/documents.py
file if needed.
Note: The self-build datasets our paper used cannot be shared for patient privacy and proprietary issues.
You can specify the name of the training session in config.json
files:
"name": "PICK_Default",
"run_id": "test"
The checkpoints will be saved in save_dir/name/run_id_timestamp/checkpoint_epoch_n
, with timestamp in mmdd_HHMMSS format.
A copy of config.json
file will be saved in the same folder.
Note: checkpoints contain:
{
'arch': arch,
'epoch': epoch,
'state_dict': self.model.state_dict(),
'optimizer': self.optimizer.state_dict(),
'monitor_best': self.monitor_best,
'config': self.config
}
This project supports Tensorboard visualization by using either torch.utils.tensorboard
or TensorboardX.
-
Install
If you are using pytorch 1.1 or higher, install tensorboard by 'pip install tensorboard>=1.14.0'.
Otherwise, you should install tensorboardx. Follow installation guide in TensorboardX.
-
Run training
Make sure that
tensorboard
option in the config file is turned on."tensorboard" : true
-
Open Tensorboard server
Type
tensorboard --logdir saved/log/
at the project root, then server will open athttp://localhost:6006
By default, values of loss will be logged. If you need more visualizations, use add_scalar('tag', data)
, add_image('tag', image)
, etc in the trainer._train_epoch
method.
add_something()
methods in this project are basically wrappers for those of tensorboardX.SummaryWriter
and torch.utils.tensorboard.SummaryWriter
modules.
Note: You don't have to specify current steps, since WriterTensorboard
class defined at logger/visualization.py
will track current steps.
-
Dataset cache mechanism to speed up training loop - Multi-node multi-gpu setup (DistributedDataParallel)
If you find this code useful please cite our paper:
@inproceedings{Yu2020PICKPK,
title={{PICK}: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks},
author={Wenwen Yu and Ning Lu and Xianbiao Qi and Ping Gong and Rong Xiao},
booktitle={2020 25th International Conference on Pattern Recognition (ICPR)},
year={2020}
}
This project is licensed under the MIT License. See LICENSE for more details.
This project structure takes example by PyTorch Template Project.