Fix handling for empty record batches #1107

SoMuchForSubtlety · 2023-03-31T22:07:35Z

With magic v2 and above, it's possible to have empty record batches. This was previously not considered and could lead to panics because the code relied on the record count being non-zero in multiple places.

This PR contains lots of refactoring to make the code easier to follow and avoid unneccesary readHeader calls. Please let me know if you are unhappy with any of the changes I made.

I didn't manage to create a reliable test case, but I confirmed that the fix manually. I'm open to suggestions for how to test this.

Fixes #874

SoMuchForSubtlety · 2023-04-14T08:14:35Z

Any chance of getting a review? This addresses a serious bug.

rhansen2 · 2023-05-05T03:35:44Z

Hi @SoMuchForSubtlety, Thanks for the contribution! It would be great if we could introduce a test case. Potentially via a packet capture and the test fixtures?

https://github.com/segmentio/kafka-go/blob/main/message_test.go#L568

SoMuchForSubtlety · 2023-05-08T08:35:22Z

I tried to get Kafka to send my empty batches in a test by sending lots of duplicate keys and waiting for them to be consolidated, but I did not manage to get it working.

I will try again when I have some time.

With magic v2 and above, it's possible to have empty record batches. This was previously not considered and could lead to panics because the code relied on the record count being non-zero in multiple places. Fixes segmentio#874

SoMuchForSubtlety · 2023-05-09T22:42:06Z

Turns out that empty batches are only possible if they were sent by an idempotent producer (which kafka-go currently doesn't support).

On compaction: unlike the older message formats, magic v2 and above preserves the first and last offset/sequence numbers from the original batch when the log is cleaned. [...] The base sequence number must be preserved for duplicate checking (the broker checks incoming Produce requests for duplicates by verifying that the first and last sequence numbers of the incoming batch match the last from that producer). As a result, it is possible to have empty batches in the log when all the records in the batch are cleaned but batch is still retained in order to preserve a producer's last sequence number. ¹

Here is the content of a test log created with kafka-go. The batch sequence numbers are not set (-1), so there was no need to preserve empty batches. Therefore, we have gaps between offset 2 and 6, and between 8 and 54 without any batches.

kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /kafka/kafka-logs-e1e7d1bd93aa/kafka-go-69b4a7695fa8c9c1-0/00000000000000000000.log
Dumping /kafka/kafka-logs-e1e7d1bd93aa/kafka-go-69b4a7695fa8c9c1-0/00000000000000000000.log
Starting offset: 0
baseOffset: 0 lastOffset: 2 count: 3 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 0 CreateTime: 1683670701120 size: 97 magic: 2 compresscodec: NONE crc: 1985786587 isvalid: true
| offset: 0 CreateTime: 1683670701120 keysize: 1 valuesize: 4 sequence: -1 headerKeys: [] key: 1 payload: data
| offset: 1 CreateTime: 1683670701120 keysize: 1 valuesize: 4 sequence: -1 headerKeys: [] key: 2 payload: data
| offset: 2 CreateTime: 1683670701120 keysize: 1 valuesize: 4 sequence: -1 headerKeys: [] key: 3 payload: data
baseOffset: 6 lastOffset: 8 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 97 CreateTime: 1683670702141 size: 73 magic: 2 compresscodec: NONE crc: 2738021477 isvalid: true
| offset: 8 CreateTime: 1683670702141 keysize: 1 valuesize: 4 sequence: -1 headerKeys: [] key: 9 payload: data
baseOffset: 54 lastOffset: 56 count: 3 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 170 CreateTime: 1683670711244 size: 100 magic: 2 compresscodec: NONE crc: 3357349968 isvalid: true
| offset: 54 CreateTime: 1683670711244 keysize: 1 valuesize: 5 sequence: -1 headerKeys: [] key: 4 payload: data2
| offset: 55 CreateTime: 1683670711244 keysize: 1 valuesize: 5 sequence: -1 headerKeys: [] key: 5 payload: data2
| offset: 56 CreateTime: 1683670711244 keysize: 1 valuesize: 5 sequence: -1 headerKeys: [] key: 6 payload: data2

I will create some test data with an idempotent producer and write a test case.

https://kafka.apache.org/documentation/#recordbatch ↩

SoMuchForSubtlety · 2023-05-19T13:22:04Z

I'm not sure if this is related to empty batches or not, but I'm seeing data corruption with this patch applied.

2023-05-19T14:30:36+02:00	error	failed to unmarshal revision: proto: Order: illegal tag 0 (wire type 2)
2023-05-19T14:31:17+02:00	error	failed to unmarshal revision: proto: Order: wiretype end group for non-group
2023-05-19T14:31:25+02:00	error	failed to unmarshal revision: proto: wrong wireType = 0 for field ContractId
2023-05-19T14:31:25+02:00	error	failed to unmarshal revision: unexpected EOF
2023-05-19T14:31:27+02:00	error	failed to unmarshal revision: proto: Order: wiretype end group for non-group
2023-05-19T14:31:29+02:00	error	failed to unmarshal revision: proto: Order: illegal tag 0 (wire type 2)

The message headers also seem to be affected, because the committed offsets reset for one of my partitions get reset

2023-05-19T14:31:17+02:00	info	committed offsets for group timescale-persist: 
	topic: order-revisions
		partition 9: 56783030
		partition 3: 56721878
		partition 4: 56756076
		partition 6: 56727487
		partition 5: 56756520
		partition 8: 55962711
		partition 7: 56800235
		partition 1: 56813820
		partition 0: 56843322
2023-05-19T14:31:17+02:00	info	committed offsets for group timescale-persist: 
	topic: order-revisions
		partition 3: 56724823
		partition 6: 56727690
		partition 1: 56814771
		partition 7: 56801481
		partition 0: 2
		partition 4: 56756085
		partition 9: 56783034
		partition 8: 55962711
		partition 5: 56761862
2023-05-19T14:31:17+02:00	info	committed offsets for group timescale-persist: 
	topic: order-revisions
		partition 4: 56756086
		partition 8: 55962711
		partition 5: 56765388
		partition 3: 56727107
		partition 7: 56804923
		partition 9: 56785845
		partition 6: 56727694
		partition 1: 56814807
		partition 0: 3

steevehook · 2023-05-25T10:48:58Z

Are contributions to this one allowed? Was wondering what's the progress on this issue, we keep getting these panics frequently. Let me know if help is needed to speed this up ✌️

niafly · 2023-07-07T02:57:28Z

Same question with @steevehook above, I wonder if there's a checklist we can fill to faster the process of fixing this. We're maintaining a private forked version of the library just to fix this issue and we really hope to get synced back with the public releases soon.

SoMuchForSubtlety · 2023-07-07T07:08:05Z

I ended up switching to a different library since I still experienced strange bugs even with my patch applied. Feel free to pick up my changes and open a new PR.

rhansen2 · 2023-07-07T16:23:33Z

Thanks for following up! Any contributions related to this issue are welcome! Thanks!

niafly · 2023-07-14T16:36:34Z

Hi @rhansen2, I just open up another PR to fix this issue that happens with our team on production: #1164

It's been a while since so I'm trying to mimic the situation and provide some additional test cases. Feel free to leave comments on the PR. Thanks.

SoMuchForSubtlety force-pushed the fix-empty-batch-panic branch 2 times, most recently from 957669f to e6ac6fa Compare April 3, 2023 17:35

rhansen2 self-assigned this May 5, 2023

Fix handling for empty record batches

b130648

With magic v2 and above, it's possible to have empty record batches. This was previously not considered and could lead to panics because the code relied on the record count being non-zero in multiple places. Fixes segmentio#874

SoMuchForSubtlety force-pushed the fix-empty-batch-panic branch from e6ac6fa to b130648 Compare May 8, 2023 09:51

SoMuchForSubtlety closed this Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling for empty record batches #1107

Fix handling for empty record batches #1107

SoMuchForSubtlety commented Mar 31, 2023 •

edited

Loading

SoMuchForSubtlety commented Apr 14, 2023 •

edited

Loading

rhansen2 commented May 5, 2023

SoMuchForSubtlety commented May 8, 2023

SoMuchForSubtlety commented May 9, 2023

SoMuchForSubtlety commented May 19, 2023

steevehook commented May 25, 2023

niafly commented Jul 7, 2023 •

edited

Loading

SoMuchForSubtlety commented Jul 7, 2023

rhansen2 commented Jul 7, 2023

niafly commented Jul 14, 2023

Fix handling for empty record batches #1107

Fix handling for empty record batches #1107

Conversation

SoMuchForSubtlety commented Mar 31, 2023 • edited Loading

SoMuchForSubtlety commented Apr 14, 2023 • edited Loading

rhansen2 commented May 5, 2023

SoMuchForSubtlety commented May 8, 2023

SoMuchForSubtlety commented May 9, 2023

Footnotes

SoMuchForSubtlety commented May 19, 2023

steevehook commented May 25, 2023

niafly commented Jul 7, 2023 • edited Loading

SoMuchForSubtlety commented Jul 7, 2023

rhansen2 commented Jul 7, 2023

niafly commented Jul 14, 2023

SoMuchForSubtlety commented Mar 31, 2023 •

edited

Loading

SoMuchForSubtlety commented Apr 14, 2023 •

edited

Loading

niafly commented Jul 7, 2023 •

edited

Loading