feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

sjrl · 2025-02-05T12:15:09Z

Related Issues

fixes Create a CSV Document cleaner component #8783

Proposed Changes:

Adds a new component called CSVDocumentCleaner.

This component is for cleaning CSV documents by removing empty rows and columns. Removing empty rows and columns can help reduce unnecessary token usage when sending this CSV document to an LLM in a RAG pipeline.

It allows for the optional ignoring of a specified number of rows and columns before performing the cleaning operation. This is relevant if you want to ignore header rows or columns (e.g. the header of an Excel file) before removing empty rows or columns.

How did you test it?

Added tests.

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2025-02-05T12:20:46Z

Pull Request Test Coverage Report for Build 13157297160

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 91.299%

Totals
Change from base Build 13141979385:	0.0%
Covered Lines:	8993
Relevant Lines:	9850

💛 - Coveralls

davidsbatista · 2025-02-05T17:40:09Z

haystack/components/preprocessors/__init__.py

 from .document_cleaner import DocumentCleaner
 from .document_splitter import DocumentSplitter
 from .recursive_splitter import RecursiveDocumentSplitter
 from .text_cleaner import TextCleaner

-__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner"]
+__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "CSVDocumentCleaner"]


(nit) suggestion: let's keep this list ordered alphabetically as it grows makes it easier to locate components

davidsbatista · 2025-02-05T17:41:27Z

haystack/components/preprocessors/csv_document_cleaner.py

+from io import StringIO
+from typing import Dict, List
+
+import pandas as pd


since we are probably removing pandas as a hard dependency, we should start importing it in a lazy way

davidsbatista · 2025-02-05T17:55:22Z

releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml

+---
+features:
+  - |
+    Introduced CSVDocumentCleaner component for cleaning CSV documents.


Suggested change

Introduced CSVDocumentCleaner component for cleaning CSV documents.

Introduced `CSVDocumentCleaner` component for cleaning CSV documents.

davidsbatista · 2025-02-05T18:10:00Z

haystack/components/preprocessors/csv_document_cleaner.py

+                cleaned_documents.append(document)
+                continue
+
+            # Save ignored rows


I would refactor the save ignored rows (and also save ignored columns) logic into helper functions - too keep this for loop concise.

def _handle_ignored_rows(data_frame) -> Tuple[df_saved_rows | None, bool]:

Returning the saved rows/columns or None and True/False depending on the number of rows/columns to ignore is smaller/bigger than the total number of rows

Both use cases (row and columns) can even be handled by a single function.

sjrl added 2 commits February 5, 2025 12:49

Initial commit for csv cleaner

bc9cde2

Add release notes

db4154c

sjrl requested review from a team as code owners February 5, 2025 12:15

sjrl requested review from dfokina and vblagoje and removed request for a team February 5, 2025 12:15

github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 5, 2025

Update lineterminator

30573c8

davidsbatista reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

sjrl commented Feb 5, 2025 •

edited

Loading

coveralls commented Feb 5, 2025 •

edited

Loading

davidsbatista Feb 5, 2025

davidsbatista Feb 5, 2025

davidsbatista Feb 5, 2025

davidsbatista Feb 5, 2025 •

edited

Loading

	Introduced CSVDocumentCleaner component for cleaning CSV documents.
	Introduced `CSVDocumentCleaner` component for cleaning CSV documents.

feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

Are you sure you want to change the base?

feat: Add component CSVDocumentCleaner for removing empty rows and columns #8816

Conversation

sjrl commented Feb 5, 2025 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Feb 5, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13157297160

Details

💛 - Coveralls

davidsbatista Feb 5, 2025

Choose a reason for hiding this comment

davidsbatista Feb 5, 2025

Choose a reason for hiding this comment

davidsbatista Feb 5, 2025

Choose a reason for hiding this comment

davidsbatista Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

sjrl commented Feb 5, 2025 •

edited

Loading

coveralls commented Feb 5, 2025 •

edited

Loading

davidsbatista Feb 5, 2025 •

edited

Loading