Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the possibility of removing dataframe field from Document #8627

Closed
anakin87 opened this issue Dec 11, 2024 · 3 comments
Closed
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@anakin87
Copy link
Member

I've been thinking about dropping the Dataframe field in Haystack Document dataclass for a few reasons:

  • Users are already using text representations (CSV, Markdown) that LLMs handle great - even for originally tabular data
  • Pandas DataFrame creates serialization headaches. (e.g. in Hayhooks)
  • Pandas is a heavy dependency that complicates things, especially in serverless environments like Lambda. We could make it optional.
  • Supporting dataframes across different Document Stores requires complex workarounds.

I will reach out to internal and external users to validate my assumptions.
We should also investigate how impactful this change would be.

@anakin87
Copy link
Member Author

@EdAbati, the author of dataframes-haystack confirmed that this idea makes sense to him.
@sjrl too.

@anakin87
Copy link
Member Author

anakin87 commented Jan 8, 2025

Possible plan

  • Gather more feedback on the idea from the community with a GitHub discussion/Discord announcement (📢 Potential removal of `dataframe` field in Haystack `Document` - We need your feedback! #8688)
  • Deprecate dataframe field and ExtractedTableAnswer (in 2.10 release?)
  • Remove dataframe support from Document Stores
    • Remove dataframe support from InMemoryDocumentStore (after 2.10)
    • Remove dataframe support from Document Stores in core-integrations 1
  • refactor AzureOCRDocumentConverter to remove internal usage of dataframe (this can be done at any time).
  • Remove dataframe field and ExtractedTableAnswer (in 2.11 release?)
  • Make pandas dependency optional (in 2.11 release?)

Footnotes

  1. This may be best addressed after deprecating dataframe, but could also be done earlier. Support across integrations is highly varied and often inconsistent, making this the most significant task.

@anakin87
Copy link
Member Author

Investigation done.
I created another issue to keep track of the plan: #8738.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

3 participants