Add `tracing` support to the compressor by connortsui20 · Pull Request #7385 · vortex-data/vortex

connortsui20 · 2026-04-10T14:24:06Z

Summary

Tracking issue: #7216

We have very little observability into the compressor. When we are debugging, we don't really have any idea of what schemes the compressor is trying, how good or how bad estimates are, how reliable sampling is, how the cascading paths look, etc.

This change adds tracing support to vortex-compressor. The compressor now emits structured tracing spans and events across four composable RUST_LOG targets (cascade, select, estimate, encode).

The scheme.compress_result event is the most important, which reports before/after bytes and estimated vs actual ratio, with a new short_circuit { reason = "larger_output" } surfacing the previously-silent case where a chosen scheme produced a larger output than the canonical input.

All instrumentation lives in the orchestration layer (none of the ~23 individual Scheme impls were touched), field names are stable so tracing-perfetto/opentelemetry/timing subscribers work with no adapter code, and an integration test pins the event names against rename.

I still need to figure out how to make this useful when the compressor generates a HUGE amount of logs (for example when it produces logs when generating TPC-H partition files).

Testing

Some basic integration testing for tracing.

connortsui20 · 2026-04-10T15:43:59Z

Here is an example of some information we can get from the tracing json output.

This is looking at the trace for generating tpch SF1 data, which produces 6837 logs.

RUST_LOG=vortex_compressor::encode=debug \
          cargo run --release --bin data-gen -- \
              --log-format json \
              --opt scale-factor=1.0 \
              --formats vortex \
              tpch \
          2> trace.jsonl

This is an example of looking at all times the compressor chooses a scheme and the final result ends up being larger than the original array:

jq -r 'select(.fields.message == "scheme.compress_result"
                and .fields.accepted == false)
         | .fields.scheme' trace.jsonl \
        | sort | uniq -c | sort -rn

 143 vortex.int.for
  27 vortex.bool.constant
  24 vortex.int.dict
  10 vortex.int.bitpacking
   6 vortex.int.constant

And we can see that the estimator for FoR is very off for some reason:

Details

❯ jq -n '
    [inputs
     | select(.fields.message == "scheme.compress_result")
     | .fields as $f
     | select(($f.after_nbytes // 0) > 0
              and ($f.before_nbytes // 0) > 0
              and ($f.estimated_ratio // null) != null)
     | ($f.before_nbytes / $f.after_nbytes) as $actual_ratio
     | {
         scheme:          $f.scheme,
         estimated_ratio: $f.estimated_ratio,
         actual_ratio:    $actual_ratio,
         before_nbytes:   $f.before_nbytes,
         after_nbytes:    $f.after_nbytes,
         accepted:        $f.accepted,
         relative_error:  (($f.estimated_ratio - $actual_ratio) / $actual_ratio)
       }
     | select(.relative_error > 0)
    ]
    | sort_by(.relative_error)
    | reverse
    | .[:15]
  ' trace.jsonl
[
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 1.6,
    "actual_ratio": 0.003125,
    "before_nbytes": 2,
    "after_nbytes": 640,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 4.571428571428571,
    "actual_ratio": 0.008928571428571428,
    "before_nbytes": 8,
    "after_nbytes": 896,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 4.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 8,
    "after_nbytes": 1024,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 1.6,
    "actual_ratio": 0.003125,
    "before_nbytes": 2,
    "after_nbytes": 640,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 8.0,
    "actual_ratio": 0.015625,
    "before_nbytes": 2,
    "after_nbytes": 128,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.56,
    "actual_ratio": 0.005,
    "before_nbytes": 16,
    "after_nbytes": 3200,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 5.818181818181818,
    "actual_ratio": 0.011363636363636364,
    "before_nbytes": 16,
    "after_nbytes": 1408,
    "accepted": false,
    "relative_error": 511
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.6666666666666665,
    "actual_ratio": 0.005208333333333333,
    "before_nbytes": 2,
    "after_nbytes": 384,
    "accepted": false,
    "relative_error": 510.99999999999994
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 5.333333333333333,
    "actual_ratio": 0.010416666666666666,
    "before_nbytes": 8,
    "after_nbytes": 768,
    "accepted": false,
    "relative_error": 510.99999999999994
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  },
  {
    "scheme": "vortex.int.for",
    "estimated_ratio": 2.0,
    "actual_ratio": 0.0078125,
    "before_nbytes": 4,
    "after_nbytes": 512,
    "accepted": false,
    "relative_error": 255
  }
]

And then here is how much each scheme saved across all of TPC-H SF1 data.

Details

❯ jq -c 'select(.fields.message == "scheme.compress_result"
                and .fields.accepted == true)
         | {scheme: .fields.scheme,
            saved: (.fields.before_nbytes - .fields.after_nbytes)}' \
          trace.jsonl \
        | jq -s 'group_by(.scheme)
             | map({scheme: .[0].scheme,
                    n: length,
                    total_saved: (map(.saved) | add)})
             | sort_by(-.total_saved)'

[
  {
    "scheme": "vortex.string.fsst",
    "n": 1330,
    "total_saved": 357089797
  },
  {
    "scheme": "vortex.int.bitpacking",
    "n": 3670,
    "total_saved": 195278268
  },
  {
    "scheme": "vortex.decimal.byte_parts",
    "n": 247,
    "total_saved": 172207404
  },
  {
    "scheme": "vortex.int.for",
    "n": 762,
    "total_saved": 82129120
  },
  {
    "scheme": "vortex.int.runend",
    "n": 74,
    "total_saved": 25347394
  },
  {
    "scheme": "vortex.int.constant",
    "n": 181,
    "total_saved": 6516208
  },
  {
    "scheme": "vortex.int.sequence",
    "n": 40,
    "total_saved": 5439040
  },
  {
    "scheme": "vortex.string.dict",
    "n": 17,
    "total_saved": 2188574
  },
  {
    "scheme": "vortex.string.constant",
    "n": 39,
    "total_saved": 128547
  },
  {
    "scheme": "vortex.int.rle",
    "n": 71,
    "total_saved": 62026
  },
  {
    "scheme": "vortex.int.dict",
    "n": 69,
    "total_saved": 19213
  },
  {
    "scheme": "vortex.bool.constant",
    "n": 120,
    "total_saved": 3290
  },
  {
    "scheme": "vortex.int.sparse",
    "n": 7,
    "total_saved": 1405
  }
]

## Summary Tracking issue: #7216 Makes the compressor types more robust (removes the possibility for invalid state), which additionally sets up adding tracing easier (draft at #7385) ## API Changes Changes some types: ```rust /// Closure type for [`DeferredEstimate::Callback`]. /// /// The compressor calls this with the same arguments it would pass to sampling. The closure must /// resolve directly to a terminal [`EstimateVerdict`]. #[rustfmt::skip] pub type EstimateFn = dyn FnOnce( &CascadingCompressor, &mut ArrayAndStats, CompressorContext, ) -> VortexResult<EstimateVerdict> + Send + Sync; /// The result of a [`Scheme`]'s compression ratio estimation. /// /// This type is returned by [`Scheme::expected_compression_ratio`] to tell the compressor how /// promising this scheme is for a given array without performing any expensive work. /// /// [`CompressionEstimate::Verdict`] means the scheme already knows the terminal answer. /// [`CompressionEstimate::Deferred`] means the compressor must do extra work before the scheme can /// produce a terminal answer. #[derive(Debug)] pub enum CompressionEstimate { /// The scheme already knows the terminal estimation verdict. Verdict(EstimateVerdict), /// The compressor must perform deferred work to resolve the terminal estimation verdict. Deferred(DeferredEstimate), } /// The terminal answer to a compression estimate request. #[derive(Debug)] pub enum EstimateVerdict { /// Do not use this scheme for this array. Skip, /// Always use this scheme, as it is definitively the best choice. /// /// Some examples include constant detection, decimal byte parts, and temporal decomposition. /// /// The compressor will select this scheme immediately without evaluating further candidates. /// Schemes that return `AlwaysUse` must be mutually exclusive per canonical type (enforced by /// [`Scheme::matches`]), otherwise the winner depends silently on registration order. /// /// [`Scheme::matches`]: crate::scheme::Scheme::matches AlwaysUse, /// The estimated compression ratio. This must be greater than `1.0` to be considered by the /// compressor, otherwise it is worse than the canonical encoding. Ratio(f64), } /// Deferred work that can resolve to a terminal [`EstimateVerdict`]. pub enum DeferredEstimate { /// The scheme cannot cheaply estimate its ratio, so the compressor should compress a small /// sample to determine effectiveness. Sample, /// A fallible estimation requiring a custom expensive computation. /// /// Use this only when the scheme needs to perform trial encoding or other costly checks to /// determine its compression ratio. The callback returns an [`EstimateVerdict`] directly, so /// it cannot request more sampling or another deferred callback. Callback(Box<EstimateFn>), } ``` This will make some changes that we want to make is the future easier as well (tracing, better decision making for what things to try, etc). ## Testing Some new tests Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

robert3005

I think this is reasonable but will wait for it to not be a draft

Instrument the cascading compressor with composable `tracing` spans and events so users can see what the compressor is doing, compare estimated and actual compression ratios, time individual phases, and surface previously-silent "compressed but the output grew" decisions. Four targets let users select one aspect at a time via `RUST_LOG`: - `vortex_compressor::cascade` — top-level + `compress_child` spans - `vortex_compressor::select` — scheme eligibility, evaluation, winner, and short-circuit reasons - `vortex_compressor::estimate` — sampling span and sample.collected / sample.result events - `vortex_compressor::encode` — per-scheme encode span and the scheme.compress_result event with estimated vs actual ratio + accepted Spans are at `trace` level so `tracing-perfetto` / `tracing-timing` / `tracing-opentelemetry` only materialize them on demand. Events are at `debug` for outcomes so `RUST_LOG=vortex_compressor::encode=debug` produces one readable summary line per leaf. New `tests/tracing.rs` uses a custom capture layer (not `TestWriter`) to pin the names and stable fields of the emitted events so downstream observability tooling does not break under rename. Instrumentation lives entirely in the orchestration layer (compressor.rs + estimate.rs); individual scheme implementations are untouched. The existing unstructured calls in estimate.rs and the stale commented-out line in compressor.rs are removed. A new `# Observability` section in the crate docs carries the full target / span / event reference with `RUST_LOG` recipes. Signed-off-by: Claude <noreply@anthropic.com>

Instrument `BtrBlocksCompressor::compress` with a `#[tracing::instrument]` on the `vortex_compressor::cascade` target so downstream trace consumers (tracing-perfetto, tracing-opentelemetry) get a distinct BtrBlocks entry frame nested above the generic `CascadingCompressor::compress` pipeline span. Also delete the stray `tracing::debug!("zigzag output: {}", ...)` line in `schemes/integer.rs` — it predates the centralized `scheme.compress_result` event and is now redundant. Add a short `# Observability` section to the crate docs pointing at `vortex_compressor`'s full reference, plus one recipe. Signed-off-by: Claude <noreply@anthropic.com>

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 added the changelog/feature A new feature label Apr 10, 2026

connortsui20 requested review from a10y and robert3005 April 10, 2026 14:24

connortsui20 force-pushed the ct/compress-tracing branch from 7f4b5d7 to 3f1411c Compare April 13, 2026 15:08

connortsui20 mentioned this pull request Apr 13, 2026

More robust types in the compressor #7415

Merged

connortsui20 force-pushed the ct/compress-tracing branch 4 times, most recently from 56bdc36 to 414149e Compare April 13, 2026 20:58

connortsui20 marked this pull request as ready for review April 13, 2026 21:24

connortsui20 marked this pull request as draft April 14, 2026 01:40

robert3005 reviewed Apr 14, 2026

View reviewed changes

claude and others added 4 commits April 14, 2026 11:20

add json output

2d77ca8

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

clean up

43e333a

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 force-pushed the ct/compress-tracing branch from 414149e to 43e333a Compare April 14, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `tracing` support to the compressor#7385

Add `tracing` support to the compressor#7385
connortsui20 wants to merge 4 commits intodevelopfrom
ct/compress-tracing

connortsui20 commented Apr 10, 2026 •

edited

Loading

Uh oh!

connortsui20 commented Apr 10, 2026

Uh oh!

robert3005 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

connortsui20 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

connortsui20 commented Apr 10, 2026

Uh oh!

robert3005 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

connortsui20 commented Apr 10, 2026 •

edited

Loading