Add Cython BinaryEncoder for Avro block encoding by rynewang · Pull Request #3303 · apache/iceberg-python

rynewang · 2026-04-30T21:37:01Z

Summary

Mirrors the existing CythonBinaryDecoder (decoder_fast.pyx). The pure-Python BinaryEncoder emits each varint byte as a fresh bytes([x]) allocation plus a stream-write call; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue().

Integration

AvroOutputFile.write_block now constructs its in-memory block encoder via a new new_memory_encoder() factory (same pattern as new_decoder()): returns CythonBinaryEncoder when the extension is built, otherwise a thin MemoryBinaryEncoder wrapper around the existing BinaryEncoder + BytesIO. The header/framing encoder (self.encoder) is unchanged — it writes directly to the output stream and is low-volume.

Benchmark

Encoding 50k ManifestEntry records (14 columns with full column stats — column_sizes, value_counts, null_value_counts, lower_bounds, upper_bounds), through the real construct_writer tree:

encoder	wall	throughput	output bytes
pure Python	1.64 s	30.5 k/s	18,492,808
Cython	0.36 s	138.0 k/s	18,492,808

~4.5× at the encoder-leaf level; the remaining time is the Python Writer tree dispatch, which is unchanged.

Testing

tests/avro/test_encoder.py is parametrised over both implementations so every primitive assertion runs against each.
New test_int_round_trip covers zigzag edge cases including int64 min/max via encode→new_decoder→assert.
New test_encoders_byte_identical asserts both implementations produce identical bytes for a mixed payload.
Existing tests/avro/ (171 tests) and tests/utils/test_manifest.py (manifest write/read round-trip) pass.

Notes

write_utf8 / write_bytes accept untyped args (matching the pure-Python duck-typed behaviour) since callers pass str-enum values like FileFormat.PARQUET.
write_float / write_double use STRUCT_FLOAT.pack (explicit little-endian) rather than raw memcpy, same as the decoder — they're not on the hot path.
Zigzag is done on uint64_t to avoid signed-shift UB.

Mirrors the existing CythonBinaryDecoder. The pure-Python BinaryEncoder emits each varint byte via bytes([x]) and a stream write per primitive; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue(). AvroOutputFile.write_block now uses new_memory_encoder() which returns the Cython implementation when the extension is built and falls back to a MemoryBinaryEncoder wrapper otherwise (same pattern as new_decoder()). Encoding 50k realistic ManifestEntry records (14 columns with full stats) goes from 1.64s to 0.36s (4.5x), byte-identical output. Tests are parametrised over both implementations and include int64-boundary round-trips and a byte-equivalence check.

github-actions · 2026-05-31T00:52:12Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

abnobdoss · 2026-06-01T01:05:55Z

+    "v",
+    [0, 1, -1, 63, 64, -64, -65, 127, 128, -128, 2**31 - 1, -(2**31), 2**62, -(2**62), 2**63 - 1, -(2**63)],
+)
+def test_int_round_trip(encoder: Any, v: int) -> None:


Nice! This is actually stricter than the python encoder method: python currently accepts 2**63 and -(2**63) - 1 even though those aren’t valid Avro longs. Should we align them / add out-of-range tests so behavior doesn’t depend on which encoder is used?

abnobdoss · 2026-06-01T01:11:32Z

+        self._size += n
+
+    def write_utf8(self, s) -> None:
+        self.write_bytes(s.encode("utf-8"))


minor: should this use UTF8 from pyiceberg.typedef?

rynewang mentioned this pull request Apr 30, 2026

Roll added manifests at commit.manifest.target-size-bytes in fast-append #3304

Closed

github-actions Bot added the stale label May 31, 2026

abnobdoss approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cython BinaryEncoder for Avro block encoding#3303

Add Cython BinaryEncoder for Avro block encoding#3303
rynewang wants to merge 1 commit into
apache:mainfrom
rynewang:perf/cython-avro-encoder

rynewang commented Apr 30, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

abnobdoss Jun 1, 2026

Uh oh!

abnobdoss Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rynewang commented Apr 30, 2026

Summary

Integration

Benchmark

Testing

Notes

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

abnobdoss Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

abnobdoss Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants