Reading order and multi-column #27

lgeo3 · 2021-04-28T14:48:30Z

Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.

When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.

In my understanding, the reading order is really important to be able to use/finetune Bert.

Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).

Here are my questions:

can you confirm that the word in the .txt files are not necessarly in the reading order ?
do you provide the dataset in the reading order ?

Thank you again, and I hope that my questions make sense.

liminghao1630 · 2021-05-17T06:02:45Z

We provide the words in the order of the PDFPlumber.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading order and multi-column #27

Reading order and multi-column #27

lgeo3 commented Apr 28, 2021 •

edited

Loading

liminghao1630 commented May 17, 2021

Reading order and multi-column #27

Reading order and multi-column #27

Comments

lgeo3 commented Apr 28, 2021 • edited Loading

liminghao1630 commented May 17, 2021

lgeo3 commented Apr 28, 2021 •

edited

Loading