Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a CSV Document splitter #8784

Open
sjrl opened this issue Jan 29, 2025 · 2 comments · May be fixed by #8795 or #8815
Open

Create a CSV Document splitter #8784

sjrl opened this issue Jan 29, 2025 · 2 comments · May be fixed by #8795 or #8815
Labels
P2 Medium priority, add to the next sprint if no P1 available type:feature New feature or request

Comments

@sjrl
Copy link
Contributor

sjrl commented Jan 29, 2025

Is your feature request related to a problem? Please describe.
This is related to this issue #8783 to make it easier to work with csv style documents in Haystack.

We've been working with more clients who have large and sometimes complicated excel and csv files that often contain multiple tables within one spread sheet.

We've found that keeping the document size manageable to be necessary in RAG use cases so we would ideally be able to split these spreadsheets into their separate tables. Otherwise we find the single massive table is too large to be effectively retrieved and often takes up too much space in the LLM context window.

Describe the solution you'd like
Therefore, it would be great to have a component that could split these single massive tables into the multiple smaller tables. I think it would make the most sense to create a separate CSV Document splitter to handle to this rather than expand our existing DocumentSplitter, but I'm open to discussion.

Additional context
Here is an example csv I created that has two tables combined into a single large table.

two-tables-in-one.csv

@sjrl sjrl added the type:feature New feature or request label Jan 29, 2025
@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Jan 31, 2025
@alex-stoica
Copy link

@sjrl did you also encounter situations with side-by-side CSVs? For example
Image
Your example would only need some vertical split, but side-by-side split requires more complexity

@sjrl
Copy link
Contributor Author

sjrl commented Feb 3, 2025

hey @alex-stoica yes I've also ran into side-by-side CSVs and I agree the example you show would require more complexity. If you have some example csv files with a variety of structures please link them here! My initial post wasn't meant to be fully comprehensive with examples, but just start the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available type:feature New feature or request
Projects
None yet
3 participants