Add Cython BinaryEncoder for Avro block encoding#3303
Conversation
Mirrors the existing CythonBinaryDecoder. The pure-Python BinaryEncoder emits each varint byte via bytes([x]) and a stream write per primitive; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue(). AvroOutputFile.write_block now uses new_memory_encoder() which returns the Cython implementation when the extension is built and falls back to a MemoryBinaryEncoder wrapper otherwise (same pattern as new_decoder()). Encoding 50k realistic ManifestEntry records (14 columns with full stats) goes from 1.64s to 0.36s (4.5x), byte-identical output. Tests are parametrised over both implementations and include int64-boundary round-trips and a byte-equivalence check.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
| "v", | ||
| [0, 1, -1, 63, 64, -64, -65, 127, 128, -128, 2**31 - 1, -(2**31), 2**62, -(2**62), 2**63 - 1, -(2**63)], | ||
| ) | ||
| def test_int_round_trip(encoder: Any, v: int) -> None: |
There was a problem hiding this comment.
Nice! This is actually stricter than the python encoder method: python currently accepts 2**63 and -(2**63) - 1 even though those aren’t valid Avro longs. Should we align them / add out-of-range tests so behavior doesn’t depend on which encoder is used?
| self._size += n | ||
|
|
||
| def write_utf8(self, s) -> None: | ||
| self.write_bytes(s.encode("utf-8")) |
There was a problem hiding this comment.
minor: should this use UTF8 from pyiceberg.typedef?
Summary
Mirrors the existing
CythonBinaryDecoder(decoder_fast.pyx). The pure-PythonBinaryEncoderemits each varint byte as a freshbytes([x])allocation plus a stream-write call; the Cython implementation writes into a growablechar*buffer with inlined zigzag encoding andmemcpy, then materialises once viagetvalue().Integration
AvroOutputFile.write_blocknow constructs its in-memory block encoder via a newnew_memory_encoder()factory (same pattern asnew_decoder()): returnsCythonBinaryEncoderwhen the extension is built, otherwise a thinMemoryBinaryEncoderwrapper around the existingBinaryEncoder+BytesIO. The header/framing encoder (self.encoder) is unchanged — it writes directly to the output stream and is low-volume.Benchmark
Encoding 50k
ManifestEntryrecords (14 columns with full column stats —column_sizes,value_counts,null_value_counts,lower_bounds,upper_bounds), through the realconstruct_writertree:~4.5× at the encoder-leaf level; the remaining time is the Python
Writertree dispatch, which is unchanged.Testing
tests/avro/test_encoder.pyis parametrised over both implementations so every primitive assertion runs against each.test_int_round_tripcovers zigzag edge cases includingint64min/max via encode→new_decoder→assert.test_encoders_byte_identicalasserts both implementations produce identical bytes for a mixed payload.tests/avro/(171 tests) andtests/utils/test_manifest.py(manifest write/read round-trip) pass.Notes
write_utf8/write_bytesaccept untyped args (matching the pure-Python duck-typed behaviour) since callers passstr-enum values likeFileFormat.PARQUET.write_float/write_doubleuseSTRUCT_FLOAT.pack(explicit little-endian) rather than rawmemcpy, same as the decoder — they're not on the hot path.uint64_tto avoid signed-shift UB.