Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document too long at transforms #1852

Open
1 task done
itogaston opened this issue Jan 16, 2025 · 1 comment
Open
1 task done

Document too long at transforms #1852

itogaston opened this issue Jan 16, 2025 · 1 comment
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments bug Something isn't working question Further information is requested

Comments

@itogaston
Copy link

  • I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I have a really long pdf (475 pages), it raises this error Saying "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
I think the problem is not the document being short but the oposite.

code to classify documents based on size:

bin_ranges = [(0, 100), (101, 500), (501, 100000)]
result = count_doc_length_bins(documents, bin_ranges)
result = {k: v / len(documents) for k, v in result.items()}

debuggin my code the size of my document gives 390000. The size exceeds the upper limit of the last bin. So there is no bin to place this, falling for the default condition which raises the above exception.

I think the Document should be process as those in the last bean or raise an Exception saying the document is too big.

@itogaston itogaston added the question Further information is requested label Jan 16, 2025
@dosubot dosubot bot added the bug Something isn't working label Jan 16, 2025
@shahules786
Copy link
Member

Hey, you're right about the error. It should have been the other way. But on the other side, if you feed in a document that is 475 pages long that would cost you very much on the extraction side of things, for example, it will try to extract headings from every page in the document,etc. The smart thing to do is split the documents into several small documents and then process a part of them for test generation. We understand this is a hack, we will take this into consideration when we roll out the next big iteration on test generation.

@shahules786 shahules786 added the answered 🤖 The question has been answered. Will be closed automatically if no new comments label Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants