Skip to content

fix: warn on empty/whitespace content in InMemoryDocumentStore.write_documents#11542

Closed
michaelxer wants to merge 1 commit into
deepset-ai:mainfrom
michaelxer:michaelxer/warn-empty-content-docs-20260607
Closed

fix: warn on empty/whitespace content in InMemoryDocumentStore.write_documents#11542
michaelxer wants to merge 1 commit into
deepset-ai:mainfrom
michaelxer:michaelxer/warn-empty-content-docs-20260607

Conversation

@michaelxer

Copy link
Copy Markdown

Related Issues

Proposed Changes

InMemoryDocumentStore.write_documents() now logs a warning when a document has empty or whitespace-only content. Previously, these documents were silently stored and would appear in BM25 retrieval results with meaningless scores, since they produce zero tokens during BM25 tokenization.

The warning is non-breaking — documents are still stored as before. This helps users identify the issue early without changing existing behavior.

How did you test it?

  • Ran the full InMemoryDocumentStore test suite: hatch run test:unit test/document_stores/test_in_memory.py — 148 passed, 4 skipped
  • Added a new test test_write_documents_warns_on_empty_content that verifies the warning is logged for empty and whitespace-only content, and not logged for valid content
  • Ran pre-commit hooks (ruff check, ruff format, codespell) — all pass

Notes for the reviewer

The warning is logged at WARNING level using the existing logger instance. Documents are still stored regardless of content — this is intentional to avoid breaking existing pipelines that might use metadata-only documents.

The async path (write_documents_async) delegates to write_documents, so it's covered automatically.

…documents

Documents with empty or whitespace-only content are stored but never
retrieved meaningfully by BM25 retrieval, since they produce zero tokens.
Add a warning log when such documents are written to help users identify
the issue early.

Fixes deepset-ai#11541
@michaelxer michaelxer requested a review from a team as a code owner June 7, 2026 14:50
@michaelxer michaelxer requested review from davidsbatista and removed request for a team June 7, 2026 14:50
@vercel

vercel Bot commented Jun 7, 2026

Copy link
Copy Markdown

@michaelxer is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 7, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@davidsbatista

Copy link
Copy Markdown
Contributor

See #11541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InMemoryDocumentStore silently accepts empty/whitespace-only content documents which pollute BM25 retrieval results

3 participants