Not extracting when PDF is large #1162

victorcasignia · 2024-09-05T14:43:29Z

Using Docker to run the service.

Used on a 21 pages PDF. It only extracts up to page 9 then it jumps to the bibliography. How do I resolve this?

lfoppiano · 2024-09-05T18:50:59Z

Hi @victorcasignia,
which document are you processing? is it a scientific article?

Could you provide the document so I can make some tests?

victorcasignia · 2024-09-12T09:11:07Z

@lfoppiano Yes. I used this document https://arxiv.org/pdf/2307.01952

lfoppiano · 2024-09-13T18:18:15Z

Hi @victorcasignia I did went through the document and the body seems to be correctly processed. You can double check that the body is all in the output. Even the head of sections are numbered correctly.

Now, the issues are all in the Appendix, which is larger than the body of the article. The first part are correctly handlded until "blue" (where the document ends). After that the model decided that it's body so all the content after page 21 is actually appended after the "Future work" section.

From our end, we could flag this issue so that we can use the document as training data for the fulltext model, but will be for version 0.8.2.

lfoppiano · 2024-11-29T09:02:50Z

I added this document as training data in #1200 and the result looks better, however the fulltext model misclassify part of the text as figure/tables.

Repository owner deleted a comment from zhoukaigo Oct 25, 2024

lfoppiano added error cases Some error/test case for future improvements models:segmentation labels Nov 28, 2024

lfoppiano added the models:fulltext label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not extracting when PDF is large #1162

Not extracting when PDF is large #1162

victorcasignia commented Sep 5, 2024

lfoppiano commented Sep 5, 2024

victorcasignia commented Sep 12, 2024

lfoppiano commented Sep 13, 2024

lfoppiano commented Nov 29, 2024

Not extracting when PDF is large #1162

Not extracting when PDF is large #1162

Comments

victorcasignia commented Sep 5, 2024

lfoppiano commented Sep 5, 2024

victorcasignia commented Sep 12, 2024

lfoppiano commented Sep 13, 2024

lfoppiano commented Nov 29, 2024