Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Feb 5, 2025

Related Issues

Proposed Changes:

Adds a new component called CSVDocumentCleaner.

This component is for cleaning CSV documents by removing empty rows and columns. Removing empty rows and columns can help reduce unnecessary token usage when sending this CSV document to an LLM in a RAG pipeline.

It allows for the optional ignoring of a specified number of rows and columns before performing the cleaning operation. This is relevant if you want to ignore header rows or columns (e.g. the header of an Excel file) before removing empty rows or columns.

How did you test it?

Added tests.

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@sjrl sjrl requested review from a team as code owners February 5, 2025 12:15
@sjrl sjrl requested review from dfokina and vblagoje and removed request for a team February 5, 2025 12:15
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 5, 2025
@coveralls
Copy link
Collaborator

coveralls commented Feb 5, 2025

Pull Request Test Coverage Report for Build 13157297160

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.299%

Totals Coverage Status
Change from base Build 13141979385: 0.0%
Covered Lines: 8993
Relevant Lines: 9850

💛 - Coveralls

from .document_cleaner import DocumentCleaner
from .document_splitter import DocumentSplitter
from .recursive_splitter import RecursiveDocumentSplitter
from .text_cleaner import TextCleaner

__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner"]
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "CSVDocumentCleaner"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) suggestion: let's keep this list ordered alphabetically as it grows makes it easier to locate components

from io import StringIO
from typing import Dict, List

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are probably removing pandas as a hard dependency, we should start importing it in a lazy way

---
features:
- |
Introduced CSVDocumentCleaner component for cleaning CSV documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Introduced CSVDocumentCleaner component for cleaning CSV documents.
Introduced `CSVDocumentCleaner` component for cleaning CSV documents.

cleaned_documents.append(document)
continue

# Save ignored rows
Copy link
Contributor

@davidsbatista davidsbatista Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would refactor the save ignored rows (and also save ignored columns) logic into helper functions - too keep this for loop concise.

def _handle_ignored_rows(data_frame) -> Tuple[df_saved_rows | None, bool]:

Returning the saved rows/columns or None and True/False depending on the number of rows/columns to ignore is smaller/bigger than the total number of rows

Both use cases (row and columns) can even be handled by a single function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a CSV Document cleaner component
3 participants