Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading order and multi-column #27

Open
lgeo3 opened this issue Apr 28, 2021 · 1 comment
Open

Reading order and multi-column #27

lgeo3 opened this issue Apr 28, 2021 · 1 comment

Comments

@lgeo3
Copy link

lgeo3 commented Apr 28, 2021

Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.

When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.

Capture d’écran 2021-04-28 à 15 27 20

Capture d’écran 2021-04-28 à 15 26 10

In my understanding, the reading order is really important to be able to use/finetune Bert.

Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).

Here are my questions:

  • can you confirm that the word in the .txt files are not necessarly in the reading order ?
  • do you provide the dataset in the reading order ?

Thank you again, and I hope that my questions make sense.

@liminghao1630
Copy link
Collaborator

We provide the words in the order of the PDFPlumber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants