Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

suryadevarapranav · 2024-11-03T05:13:20Z

Description of the Issue :

The build_payloads function here is intended to generate unique payloads for each document chunk, but currently, all payloads contain the same text (the last chunk in doc.chunks) despite having unique IDs. This is because doc.metadata is directly referenced and updated in each iteration, causing all payloads to share the same modified metadata.

See the example in the attached screenshot.

Steps to Reproduce:

Call the build_payloads function with a Document object containing multiple chunks.
Observe that the payloads list contains different IDs but identical text (matching the last chunk).

Expected Behavior: Each payload should contain the unique text for its corresponding chunk, along with the associated metadata.

Actual Behavior: All payloads contain the same text, resulting in incorrect data.

Additional Context: This issue occurs because dictionaries in Python are mutable, and assigning payload = doc.metadata results in modifying the original doc.metadata in place.

Example

News Document from Alpaca

Payload Uploaded to Qdrant

Qdrant Query used to identify the issue.

POST collections/alpaca_news/points/scroll
{
  "filter": {
    "must": [
      {
        "key": "date",
        "match": {
          "text": "2024-01-01T13:15:32+00:00"
        }
      }
    ]
  }
}

The text was updated successfully, but these errors were encountered:

suryadevarapranav · 2024-11-03T05:34:37Z

@iusztinpaul Hi Paul! I believe I've found a small bug that could be crucial during the final LLM prediction phase. Could you please take a look at my PR?

suryadevarapranav mentioned this issue Nov 3, 2024

Fix Payload Duplication Bug in build_payloads Function by Copying Metadata for Each Chunk #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

suryadevarapranav commented Nov 3, 2024

suryadevarapranav commented Nov 3, 2024 •

edited

Loading

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

Comments

suryadevarapranav commented Nov 3, 2024

Description of the Issue :

Steps to Reproduce:

Example

News Document from Alpaca

Payload Uploaded to Qdrant

Qdrant Query used to identify the issue.

suryadevarapranav commented Nov 3, 2024 • edited Loading

suryadevarapranav commented Nov 3, 2024 •

edited

Loading