Split documents by page #3074
aleksitukiainen
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Hi @aleksitukiainen! I think this feature would be a nice extension of #2932 where we added the page number as metadata to Documents. It would be nice if you could raise a feature request issue for this. Also, would you be interested in making a contribution to Haystack with this feature? Otherwise, we will put it in our backlog and work on this ourselves. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I think one natural extension of the PreProcessor.split(split_by: str) method would be to also enable splitting a document by page. Often a single page of a document contains a specific set of content that is about the same subtopic and thus splitting by page would be a great feature. Unsure how much others might be needing this, but given document splitting is one of the key ways of making sizable chunks for retrievers and readers, I feel like it will be a useful addition.
I'm currently needing this, but will probably build a manual work-around.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions