Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

Open
suryadevarapranav opened this issue Nov 3, 2024 · 1 comment

Comments

@suryadevarapranav
Copy link

Description of the Issue :

The build_payloads function here is intended to generate unique payloads for each document chunk, but currently, all payloads contain the same text (the last chunk in doc.chunks) despite having unique IDs. This is because doc.metadata is directly referenced and updated in each iteration, causing all payloads to share the same modified metadata.

See the example in the attached screenshot.

Steps to Reproduce:

Call the build_payloads function with a Document object containing multiple chunks.
Observe that the payloads list contains different IDs but identical text (matching the last chunk).

Expected Behavior: Each payload should contain the unique text for its corresponding chunk, along with the associated metadata.

Actual Behavior: All payloads contain the same text, resulting in incorrect data.

Additional Context: This issue occurs because dictionaries in Python are mutable, and assigning payload = doc.metadata results in modifying the original doc.metadata in place.

Example

News Document from Alpaca

image

Payload Uploaded to Qdrant

image

Qdrant Query used to identify the issue.

POST collections/alpaca_news/points/scroll
{
  "filter": {
    "must": [
      {
        "key": "date",
        "match": {
          "text": "2024-01-01T13:15:32+00:00"
        }
      }
    ]
  }
}
@suryadevarapranav
Copy link
Author

suryadevarapranav commented Nov 3, 2024

@iusztinpaul Hi Paul! I believe I've found a small bug that could be crucial during the final LLM prediction phase. Could you please take a look at my PR?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant