You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This is related to this issue #8783 to make it easier to work with csv style documents in Haystack.
We've been working with more clients who have large and sometimes complicated excel and csv files that often contain multiple tables within one spread sheet.
We've found that keeping the document size manageable to be necessary in RAG use cases so we would ideally be able to split these spreadsheets into their separate tables. Otherwise we find the single massive table is too large to be effectively retrieved and often takes up too much space in the LLM context window.
Describe the solution you'd like
Therefore, it would be great to have a component that could split these single massive tables into the multiple smaller tables. I think it would make the most sense to create a separate CSV Document splitter to handle to this rather than expand our existing DocumentSplitter, but I'm open to discussion.
Additional context
Here is an example csv I created that has two tables combined into a single large table.
@sjrl did you also encounter situations with side-by-side CSVs? For example
Your example would only need some vertical split, but side-by-side split requires more complexity
hey @alex-stoica yes I've also ran into side-by-side CSVs and I agree the example you show would require more complexity. If you have some example csv files with a variety of structures please link them here! My initial post wasn't meant to be fully comprehensive with examples, but just start the conversation.
Is your feature request related to a problem? Please describe.
This is related to this issue #8783 to make it easier to work with csv style documents in Haystack.
We've been working with more clients who have large and sometimes complicated excel and csv files that often contain multiple tables within one spread sheet.
We've found that keeping the document size manageable to be necessary in RAG use cases so we would ideally be able to split these spreadsheets into their separate tables. Otherwise we find the single massive table is too large to be effectively retrieved and often takes up too much space in the LLM context window.
Describe the solution you'd like
Therefore, it would be great to have a component that could split these single massive tables into the multiple smaller tables. I think it would make the most sense to create a separate CSV Document splitter to handle to this rather than expand our existing DocumentSplitter, but I'm open to discussion.
Additional context
Here is an example csv I created that has two tables combined into a single large table.
two-tables-in-one.csv
The text was updated successfully, but these errors were encountered: