-
Notifications
You must be signed in to change notification settings - Fork 144
Sampling somehow compresses into 0 bytes #7268
Copy link
Copy link
Open
Labels
bugA bug issueA bug issue
Description
We have this code in our sampling compressor (which has existed for a long time, even before the most recent changes):
let after = scheme
.compress(compressor, &mut sample_data, sample_ctx)?
.nbytes();
let before = sample_data.array().nbytes();
let ratio = before as f64 / after as f64;
tracing::debug!("estimate_compression_ratio_with_sampling(compressor={scheme:#?}) = {ratio}",);
Ok(ratio)If I add some checks:
let mut sample_data = ArrayAndStats::new(sample_array, scheme.stats_options());
let cascade_history = ctx.cascade_history().to_vec();
let sample_ctx = ctx.with_sampling();
let before = sample_data.array().nbytes();
let after = scheme
.compress(compressor, &mut sample_data, sample_ctx)?
.nbytes();
if after == 0 {
tracing::warn!(
scheme = %scheme.id(),
?cascade_history,
"sample compressed to 0 bytes, which should only happen for constant arrays",
);
}we get hundreds of warnings saying that bitpacking compresses to 0 bytes. This also means that the ratio ends up being infinity, which we interpret as invalid.
Here are some examples. These lists are in order of descent, so the first in the list is the parent scheme.
WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.dict" }, 0),
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.dict" }, 1)]
utf8 encoded as dict with values as fsst, fsst lengths as dict, and the codes as bitpacked
WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.rle" }, 1)]
utf8 encoded as fsst with lengths as rle, and then rle indices as bitpacking
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugA bug issueA bug issue