Upsert Compaction Executor: CRC mismatch robustness #13491

tibrewalpratik17 · 2024-06-27T05:51:29Z

During the execution of the Upsert Compaction task, we perform a three-way equality check of CRCs from different sources of truth: (Ref).

Segment ZK Metadata CRC: This is pushed as a task-config from generator to executor.
ValidDocID Bitmap CRC: We fetch the validDocID bitmap for the segment from one of the replica servers. In the response, we also get the segment CRC for that node.
Segment CRC (deepstore/server): The minion task execution fetches the segment from deepstore first, and if that fails, it fetches from one of the replica servers. The zipped segment also contains the CRC info.
Recently, we found segments not compacted after being queued by the task generator, even though the task output was marked as a success. Upon deeper investigation, we found these WARN entries in the log:

CRC mismatch for segment: {tablename}__4__308__20240621T2134Z, expected: 276873240, actual crc from server: 1952168594

There are several scenarios that can lead to this situation, all involving replicas having different CRCs. If the replicas didn't have different CRCs, this issue would not arise at all.

Scenario 1: Segment ZK Metadata CRC = Segment CRC deepstore != ValidDocID Bitmap CRC

The leader server uploads to ZK metadata and deepstore but is not called during the ValidDocID bitmap fetch from the minion.
In this scenario, ZK metadata CRC and deepstore CRC would match. During minion task execution, we fetch the validDocID bitmap from one of the replica servers. If that server was not the leader in uploading to ZK and deepstore during segment commit, we will end up with an inequality.

Scenario 2: Segment ZK Metadata CRC != Segment CRC deepstore

I'm not entirely sure about all the cases where this scenario would occur, but thinking out loud, it seems this might happen during a deepstore-upload-retry task. In the deepstore-upload-retry task, we randomly choose a replica server to upload to deepstore. If the chosen replica server has a different CRC compared to the segment ZK metadata, we may encounter this issue.
This can also happen when both the replicas update ZK and upload to deepstore during segment-commit. Say one replica updates ZK but is slower in uploading to deepstore but the other replica updates ZK and uploads to deepstore first. In this scenario, ZK data will be of the second replica but deepstore will have data of the first replica.

The text was updated successfully, but these errors were encountered:

tibrewalpratik17 · 2024-07-12T15:51:15Z

Scenario 1 is resolved by #13489.
For scenario 2, we should fundamentally solve it to prevent Segment ZK Metadata CRC != Segment CRC deepstore altogether. I will see the cases where we encounter this. If the solutions are intrusive then specifically for upsert-compaction, I am planning to introduce skipCrcMismatch config to solve it temporarily.

dang-stripe · 2025-01-09T19:40:24Z

@tibrewalpratik17 we hit a case for scenario 2 here: #14786

tibrewalpratik17 added enhancement upsert minion labels Jun 27, 2024

This was referenced Jun 27, 2024

Upsert Compaction: CRC mismatch robustness #13493

Open

Make upsert compaction task more robust to crc mismatch #13489

Merged

tibrewalpratik17 mentioned this issue Oct 8, 2024

add logs to debug why crc values are different upon same input data #14188

Merged

This was referenced Nov 15, 2024

Minion Task to support automatic Segment Refresh #14300

Merged

Fix crc mismatch during deepstore upload retry task #14506

Merged

tibrewalpratik17 mentioned this issue Dec 5, 2024

Upsert small segment merger task in minions #14477

Merged

tibrewalpratik17 mentioned this issue Dec 16, 2024

Add config for ignoreCrcMismatch for upsert-compaction task #14668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert Compaction Executor: CRC mismatch robustness #13491

Upsert Compaction Executor: CRC mismatch robustness #13491

tibrewalpratik17 commented Jun 27, 2024 •

edited

Loading

tibrewalpratik17 commented Jul 12, 2024

dang-stripe commented Jan 9, 2025

Upsert Compaction Executor: CRC mismatch robustness #13491

Upsert Compaction Executor: CRC mismatch robustness #13491

Comments

tibrewalpratik17 commented Jun 27, 2024 • edited Loading

tibrewalpratik17 commented Jul 12, 2024

dang-stripe commented Jan 9, 2025

tibrewalpratik17 commented Jun 27, 2024 •

edited

Loading