You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.
When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.
In my understanding, the reading order is really important to be able to use/finetune Bert.
Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).
Here are my questions:
can you confirm that the word in the .txt files are not necessarly in the reading order ?
do you provide the dataset in the reading order ?
Thank you again, and I hope that my questions make sense.
The text was updated successfully, but these errors were encountered:
Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.
When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.
In my understanding, the reading order is really important to be able to use/finetune Bert.
Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).
Here are my questions:
Thank you again, and I hope that my questions make sense.
The text was updated successfully, but these errors were encountered: