-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-138122: Allow tachyon to write and read binary output #142730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Defines the API and data structures for a high-performance binary format for profiling data. The format uses string/frame deduplication, varint encoding, and delta compression to achieve 10-50x size reduction compared to text formats. Optional zstd compression provides additional savings. The header includes inline varint encode/decode functions since these are called in tight loops during both writing and reading. Structures for both writer (BinaryWriter) and reader (BinaryReader) are defined here to allow the module.c bindings to allocate them.
Implements streaming binary output with delta compression. The writer tracks per-thread state to encode stack changes efficiently: identical stacks use RLE, similar stacks encode only the differing frames. String and frame deduplication uses Python's hashtable implementation for O(1) lookup during interning. The 512KB write buffer amortizes syscall overhead. When zstd is available, data streams through compression before hitting disk. Finalization writes the string/frame tables and footer, then seeks back to update the header with final counts and offsets.
Implements binary file parsing with stack reconstruction. On Unix, uses mmap with MADV_SEQUENTIAL for efficient sequential access. Falls back to buffered I/O on Windows. The reader reconstructs full stacks from delta-encoded records by maintaining per-thread state. Each sample's stack is rebuilt by applying the encoded operation (repeat/suffix/pop-push) to the previous stack for that thread. Replay feeds reconstructed samples to any collector, enabling conversion between formats without re-profiling.
Adds binary_io_writer.c and binary_io_reader.c to the _remote_debugging module compilation. Also hooks up optional zstd support: when libzstd is found by pkg-config, the module compiles with HAVE_ZSTD defined and links against libzstd. Without zstd, the module still builds but compression is unavailable.
Adds binary_io_writer.c, binary_io_reader.c, and binary_io.h to the Visual Studio project for _remote_debugging.
Exposes BinaryWriter and BinaryReader as Python types in _remote_debugging module. BinaryWriter wraps the C writer with write_sample() and finalize() methods. BinaryReader provides replay() to feed samples through any collector. Also adds zstd_available() function to let Python code check whether compression support was compiled in.
Thin wrapper around the C BinaryWriter. Implements the Collector interface so it can be used interchangeably with other collectors like FlamegraphCollector or GeckoCollector. Compression is configurable: 'auto' uses zstd when available, 'zstd' requires it, 'none' disables compression. The collector passes samples directly to C for encoding without building Python data structures.
Wrapper around the C BinaryReader providing file info access and replay functionality. The replay() method reconstructs samples from the binary file and feeds them to any collector, enabling format conversion without re-profiling. Includes get_info() for metadata access (sample count, thread count, compression type) and get_stats() for decoding statistics.
Adds --binary output format and --compression option to run/attach
commands. The replay command converts binary profiles to other formats:
python -m profiling.sampling replay profile.bin
python -m profiling.sampling replay --flamegraph -o out.html profile.bin
This enables a record-and-replay workflow: capture in binary format
during profiling (faster, smaller files), then convert to visualization
formats later without re-profiling.
Adds optional timestamp_us parameter to Collector.collect() method. During live profiling this is None and collectors use their own timing. During binary replay the stored timestamp is passed through, allowing collectors to reconstruct the original timing. Also fixes gecko_collector to use time.monotonic() instead of time.time() for consistency with other collectors.
Tests cover the full write/read cycle, delta encoding (RLE, suffix, pop-push), compression modes, edge cases (empty files, deep stacks, many threads), and replay through different collectors. The mock-based tests verify encoding behavior without needing actual profiling, while integration tests exercise the complete pipeline.
Documents the file layout, encoding schemes, and design rationale. Covers header/footer structure, delta encoding types (repeat, suffix, pop-push), string/frame deduplication, and compression integration. Intended for developers working on the profiler implementation.
Adds user documentation for --binary output format and the replay command. Covers compression options, the record-and-replay workflow, and examples of converting between formats.
|
I ran some benchmarks to validate the binary format implementation. Here's what I found. The test workload ran a bunch of tests from the test suite (test_list, test_dict, test_tokenize, test_exceptions, test_syntax, test_threading) taking approx 28 seconds on Linux with my Intel hybrid CPU clocked at 4.9 GHz, using ZSTD level 5 streaming compression with a 2 MB window. The binary writer hits 199,175 samples/second in this run, capturing 5.6 million samples. For reference that enough to profile 199 threads at once with 1ms sampling. I ran ZSTD compression is 0.19% of total CPU time. The binary format overhead is essentially free. Within the profiler extension, the hot functions are: The binary writing and compression functions don't even show up in the profile: they're below the 0.5% threshold. All the profiler overhead is in reading remote memory and unwinding stacks, not in the output format. Compression gets a 159.6x ratio on profiling data, turning 74.51 bytes per sample into 0.47 bytes. A 1-hour profile at 1000 samples/sec that would normally take 268 MB on disk shrinks to just 1.7 MB. The interning system stores each unique string once and references it an average of 1,658 times. Each unique frame gets referenced 653 times. Without interning, string data alone would be 79 MB. With interning, it's 42.6 KB. Raw numbers from the run: The encoding stats show RLE (run-length encoding) is working well as 27% of records are RLE repeats covering 5.5M samples. The frame efficiency shows we're saving 99.9% of frame writes through the encoding schemes. |
1d96be1 to
92e3b0c
Compare
92e3b0c to
1e2400b
Compare
Merged changes from upstream/main including: - Subprocess enumeration functionality (get_child_pids, is_python_process) - Various fixes and improvements Combined with file-output branch features: - Binary I/O writer and reader for profiling data - Binary format export/replay support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
📚 Documentation preview 📚: https://cpython-previews--142730.org.readthedocs.build/