-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 13157297160Details
💛 - Coveralls |
from .document_cleaner import DocumentCleaner | ||
from .document_splitter import DocumentSplitter | ||
from .recursive_splitter import RecursiveDocumentSplitter | ||
from .text_cleaner import TextCleaner | ||
|
||
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner"] | ||
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "CSVDocumentCleaner"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) suggestion: let's keep this list ordered alphabetically as it grows makes it easier to locate components
from io import StringIO | ||
from typing import Dict, List | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are probably removing pandas as a hard dependency, we should start importing it in a lazy way
--- | ||
features: | ||
- | | ||
Introduced CSVDocumentCleaner component for cleaning CSV documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced CSVDocumentCleaner component for cleaning CSV documents. | |
Introduced `CSVDocumentCleaner` component for cleaning CSV documents. |
cleaned_documents.append(document) | ||
continue | ||
|
||
# Save ignored rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would refactor the save ignored rows (and also save ignored columns) logic into helper functions - too keep this for loop concise.
def _handle_ignored_rows(data_frame) -> Tuple[df_saved_rows | None, bool]:
Returning the saved rows/columns or None and True/False depending on the number of rows/columns to ignore is smaller/bigger than the total number of rows
Both use cases (row and columns) can even be handled by a single function.
Related Issues
Proposed Changes:
Adds a new component called
CSVDocumentCleaner
.This component is for cleaning CSV documents by removing empty rows and columns. Removing empty rows and columns can help reduce unnecessary token usage when sending this CSV document to an LLM in a RAG pipeline.
It allows for the optional ignoring of a specified number of rows and columns before performing the cleaning operation. This is relevant if you want to ignore header rows or columns (e.g. the header of an Excel file) before removing empty rows or columns.
How did you test it?
Added tests.
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.