Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Minor compression ratio improvements for high compression modes on small data #2781

Draft
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Sep 16, 2021

This PR bundles 3 modifications for high compression modes, targeting higher compression ratio on "small" (one-block) data. The benefit may extend beyond the first block, but they are vanishing (and so does the cost).

  1. Replace "default" initial statistics by a first pass using greedy algorithms for collection of initial statistics in btultra. This step is globally positive, though benefits vary widely depending on file. It comes at a speed cost though, of around ~-10%, on one-block data (speed cost rapidly becomes insignificant for large inputs, like silesia.tar). Therefore, it was selected to not use this new method for btopt.
  2. Change the way statistics are transferred from initial run. This modification impacts btultra2, which uses btultra as an initializer. The impact is small, but constantly positive, and there is no significant cost.
  3. Chain greedy -> btultra -> btultra2 initialization. The impact is globally positive, but not always. There are wide variations depending on file type and input size. For this reason, the proposed heuristic is to enable the new method for inputs >= 8 KB, as the impact is rather positive for larger blocks, and rather negative for small ones. The speed cost, while not null, is not significant.

1. Using greedy to collect initial statistics for btultra

This step is globally beneficial, by non-negligible amounts. But there is also a sizable speed cost to it, so the trade-off is debatable. I think I'm rather favorable to it, since speed is less of a concern at btultra levels, but it also explains why this technique is not extended to the faster variant btopt.

For benchmarking, I'm using multiple block sizes, in order to observe the impact of initialization. This is compared to previous PR #2771 which updated initialization of btultra with slightly altered starting value, producing better compression ratios on small data.

2 KB blocks, btultra, level 14
file dev ultra_init S diff CR diff (%) dev ultra_init speed diff %
14#mr 4185078 4209573 24495 -0.59% 19.1 17 -10.99%
14#webster 18304250 18261816 -42434 0.23% 10.8 10.19 -5.65%
14#nci 5543695 5529530 -14165 0.26% 5.77 5.96 3.29%
14#reymont 2721845 2716265 -5580 0.21% 10.5 9.66 -8.00%
14#osdb 7766458 7760004 -6454 0.08% 19.2 16.4 -14.58%
14#x-ray 6505968 6463068 -42900 0.66% 22 19 -13.64%
14#mozilla 22692372 22690769 -1603 0.01% 11 10.5 -4.55%
14#dickens 5322380 5314059 -8321 0.16% 14.1 12.3 -12.77%
14#xml 1565521 1565338 -183 0.01% 8.89 8.75 -1.57%
14#sao 6073600 6070141 -3459 0.06% 19.8 16.5 -16.67%
14#ooffice 3712546 3712520 -26 0.00% 13.9 12.5 -10.07%
14#samba 8169234 8165127 -4107 0.05% 11.7 11.3 -3.42%
total 92562947 92458210 -104737 0.11%     -8.22%

At 2K block size, the impact is small but globally positive, saving ~100 KB.
But it comes at a cost, with speed down by -8.2% on average, due to the heavier initialization process using greedy instead of "blind" default values. There are very wide variations in this initialization cost depending on exact file. I'm unable to explain some of the most positive changes, such as a speed improvement for nci (+3.3%), but it was tested multiple times to confirm the timing, and all measurements are consistently accurate within ~0.2%, so it's not a fluke.

The best impact is achieved on x-ray (ratio +0.66%), but there is one negative impact, on mr, which is pretty large (ratio -0.59%). I haven't yet been able to find a clear explanation for this case.

followup investigation : the large negative impact on mr seems strongly correlated to the difference in initialization of literals. Essentially with "default" literals initialization "from source", the resulting literals statistics are a lot more flat than when using the outcome of greedy (which seems to leave a lot of literals). For some reason, flatter literals statistics seems better for mr. Unclear if this is a general conclusion, or really a specificity of mr.

One of the possible explanations could be that greedy leaves too many literals, thus producing more squeezed statistics favoring more frequent literals, making them cheaper hence preferable. These frequent literals might be over-represented, possible as a consequence of greedy being unable to find 3-bytes matches and thus eliminate corresponding literals. This impact is of course very variable depending on files, and it's also possible that in some other cases, the literals probabilities produced by greedy are accurate enough, maybe because there aren't many 3-bytes matches worth selecting, and trying to artificially distort these statistics, typically by flattening them, ends up hurting compression ratio (ex: nci).

16 KB blocks, btultra, level 14
file dev ultra_init S diff CR diff (%) dev ultra_init speed diff %
14#mr 3701738 3694493 -7245 0.20% 14.4 12.9 -10.42%
14#webster 13689710 13681871 -7839 0.06% 7.56 7.29 -3.57%
14#nci 3340424 3326544 -13880 0.42% 5.24 5.41 3.24%
14#reymont 2070974 2072650 1676 -0.08% 7.26 6.82 -6.06%
14#osdb 5477403 5482835 5432 -0.10% 12.2 11.4 -6.56%
14#x-ray 5952779 5914123 -38656 0.65% 19.7 16.7 -15.23%
14#mozilla 18802426 18756948 -45478 0.24% 9.15 8.87 -3.06%
14#dickens 4301141 4302009 868 -0.02% 9.55 8.64 -9.53%
14#xml 849959 849524 -435 0.05% 7.44 7.4 -0.54%
14#sao 5401683 5396945 -4738 0.09% 15.8 13.4 -15.19%
14#ooffice 3132794 3123988 -8806 0.28% 11.5 10.6 -7.83%
14#samba 6013161 6002210 -10951 0.18% 9.51 9.35 -1.68%
total 72734192 72604140 -130052 0.18%     -6.37%

At 16 KB block size, there are a few improvements : compression benefits increases to 130 KB, now representing a 0.18% ratio improvement, and the speed cost is slightly better, at -6.4%.

The best case remains x-ray, with a ratio at +0.65%, but there are many changes in the other files. mozilla, which used to be neutral at 2K, is now a big positive contributor at 16K. mr, which was negative at 2K is now positive at 16K. A few files which were slightly positive at 2K become slightly negative at 16K. So it's all over the place. Difficult to draw conclusions from these details, other than it's "globally more positive" at 16K.

nci continues to defy expectation by running faster despite a heavier initialization process.

128 KB blocks, btultra, level 17
file dev ultra_init S diff CR diff (%) dev ultra_init speed diff %
17#mr 3488777 3458279 -30498 0.87% 9.46 7.02 -25.79%
17#webster 11451515 11451879 364 0.00% 5.24 4.64 -11.45%
17#nci 2381888 2375040 -6848 0.29% 3.08 3.27 6.17%
17#reymont 1704086 1708679 4593 -0.27% 4.8 3.83 -20.21%
17#osdb 3717307 3725549 8242 -0.22% 7.02 6.46 -7.98%
17#x-ray 5604611 5562016 -42595 0.76% 12.9 10.6 -17.83%
17#mozilla 17205502 17098659 -106843 0.62% 6.41 6.13 -4.37%
17#dickens 3644846 3644167 -679 0.02% 5.95 4.74 -20.34%
17#xml 616638 615911 -727 0.12% 3.71 3.81 2.70%
17#sao 5142328 5144684 2356 -0.05% 10.03 7.62 -24.03%
17#ooffice 2884506 2874385 -10121 0.35% 8.05 7.17 -10.93%
17#samba 4912365 4888594 -23771 0.48% 5.63 5.46 -3.02%
total 62754369 62547842 -206527 0.33%     -11.42%

At 128 KB block size, the compression benefits increase even more, at > 200 KB, now representing a +0.33% ratio improvement, which is quite sizable for just a difference in statistics initialization.

The majority of the gains come from mozilla, mr and x-ray. mr is the new champion of compression ratio gains, at +0.87%. It's all the more remarkable as it was previously the only file in negative territory at 2K blocks. There are a few files which lose compression ratio (compared to default initialization), though it's never by large amounts. Once again, it's difficult to pinpoint a single clearly reproducible reason.

The average speed cost increases to -11.4%, with large differences depending on exact file. Worst speed costs are on mr and sao, with a drop of almost -25%, while nci and webster end up running faster despite the much heavier initialization.
It might be possible to reduce the speed cost by making initialization faster, either by searching less or compressing a portion of the input, but this will likely have an impact on compression ratio, so would have to be carefully monitored.

follow-up investigation : limiting statistics analysis to the first 64 KB of the first block sensibly reduces the speed cost (by more than half). However, it also loses a part of the compression ratio gain in the process (between 1/4th and 1/3rd).

2. btultra2 : modify statistics collection

btultra2 runs the first block with btultra, then throws away the result, but keeps the statistics, and re-starts compression of first block using these baseline statistics.
The second modification changes the statistics transmission logic, rebuilding them from the seqStore, instead of simply inheriting the ones which are naturally stored into the optPtr as a consequence of running btultra.

This results in a small, yet consistently positive, compression ratio gain for btultra2, at essentially no cost.
Note that, in this scenario, the first round of btultra still uses the default "blind" initialization (same as dev), not the greedy first pass mentioned in earlier part.

The benefit though is pretty small, and barely worth mentioning (I'm not detailing benchmarks here), averaging 0.01% ratio improvement. What makes it desirable is that it comes from free, and is a necessary ingredient for stage 3 to work properly.

3. btultra2 : initialize statistics from btultra, itself initialized by greedy

So that's more or less the intended end game of this exercise. Let's see if, with a first round using greedy, statistics produced by btultra get better, resulting in a better final compression ratio for btultra2.

2K blocks, btultra2, level 19
file newInit fromGreedy diff CR gain %
19#webster 18258328 18249997 -8331 0.05%
19#mozilla 22656227 22664242 8015 -0.04%
19#mr 4162214 4184597 22383 -0.54%
19#dickens 5307928 5309449 1521 -0.03%
19#x-ray 6418504 6419605 1101 -0.02%
19#ooffice 3705876 3706233 357 -0.01%
19#xml 1563494 1564207 713 -0.05%
19#sao.zs 4047944 4047920 -24 0.00%
19#samba 8156414 8154485 -1929 0.02%
19#nci 5506582 5502844 -3738 0.07%
19#sao 6066564 6065637 -927 0.02%
19#reymont 2713590 2712485 -1105 0.04%
19#osdb 7757177 7755582 -1595 0.02%
total 96320842 96337283 16441 -0.02%
16K blocks, btultra2, level 19
file newInit fromGreedy diff CR gain %
19#webster 13651795 13659077 7282 -0.05%
19#mozilla 18754305 18722807 -31498 0.17%
19#mr 3666308 3666219 -89 0.00%
19#dickens 4282713 4293499 10786 -0.25%
19#x-ray 5895473 5891200 -4273 0.07%
19#ooffice 3120827 3118902 -1925 0.06%
19#xml 843970 844599 629 -0.07%
19#sao.zs 4028682 4028694 12 0.00%
19#samba 6002451 5994430 -8021 0.13%
19#nci 3264463 3259468 -4995 0.15%
19#sao 5387448 5388448 1000 -0.02%
19#reymont 2062941 2063663 722 -0.03%
19#osdb 5473816 5474898 1082 -0.02%
total 76435192 76405904 -29288 0.04%
128K blocks, btultra2, level 19
file newInit fromGreedy diff CR gain %
19#webster 11419219 11425166 5947 -0.05%
19#mozilla 17149792 17088589 -61203 0.36%
19#mr 3452739 3445413 -7326 0.21%
19#dickens 3632427 3637405 4978 -0.14%
19#x-ray 5737386 5522651 -214735 3.74%
19#ooffice 2877489 2874593 -2896 0.10%
19#xml 614969 614973 4 0.00%
19#sao.zs 4025964 4027040 1076 -0.03%
19#samba 4916583 4890852 -25731 0.52%
19#nci 2365886 2360953 -4933 0.21%
19#sao 5132210 5138098 5888 -0.11%
19#reymont 1698111 1700230 2119 -0.12%
19#osdb 3715581 3716201 620 -0.02%
total 66738356 66442164 -296192 0.44%

So that's interesting. For some reason, this technique doesn't work well for small blocks (2 KB). It does however tend to improve compression ratio for larger blocks, by a measurable margin, reaching an almost 300 KB gain at 128 KB blocks. That's a 0.44% ratio advantage, which is quite a lot for high compression modes.

For this reason, the logic proposed in this PR has a cutoff point at 8 KB block size : below that threshold, let's start with default statistics, above that, let's initialize with greedy.

Let's be clear though that this is not a universal gain : it depends on the file. In this sample set, the largest gain is achieved by x-ray at 128 KB, achieving a staggering +3.7% compression ratio gain, which is almost oversized for simple alteration of initial statistics. It's followed by samba (+0.52%) and mozilla (+0.36%), which are pretty respectable.
On the flip side, dickens loses -0.14% and reymont -0.12%. So this is not universal.
Results at 16 KB are much more tamed, with x-ray gains reduced to +0.07%, hence barely relevant, while dickens losses increase to -0.25%. The total of 16KB changes still achieve a compression ratio benefit, but by +0.04%, which is limited.

The oversized x-ray gains deserve an explanation. Sadly, I'm reduced to guesses right now. It likely receives a correct hint from initial statistics built with greedy, that make it capable to find a better global path. A classical example would be a path which favor usage of repcodes, over other path consisting of larger yet more costly matches. This shorter global path would be essentially rejected due to wrong cost estimation when using the more standard (dev) initialization. It adds up over the full block.
However, for such explanation, I would have expected some good results at 16 KB block size too. Why it needs larger block size to really show up is unclear.

followup investigation : seems x-ray is "a medical picture of child's hand stored in 12-bit gray scaled image", maybe it needs a certain size for multiple lines to be present, hence for repcode to start making an impact (?). This is highly speculative though, as the image is "large" (8 MB) and compresses poorly, it's difficult to imagine +3.7% compression ratio gains just on repcode targeting previous line.

(sidenote : any speed cost due to greedy initialization is pretty minimalistic and irrelevant at this compression level).

Cyan4973 and others added 6 commits September 12, 2021 00:54
improves compression ratio on first block
at a speed cost of ~-5%.
btultra2 uses btultra as seed for statistics
Then btultra uses greedy as seed for statistics.
It turns out, this gives worse results for btultra2 on small blocks.
Differentiate small blocks from larger ones (bounday 8 KB),
initialize btultra with new default stats on small blocks.
@terrelln
Copy link
Contributor

Is this PR ready for review now?

@Cyan4973
Copy link
Contributor Author

Yes, it can be reviewed.

It's just not ready for "merge".

@Cyan4973 Cyan4973 changed the title Additional compression ratio improvements for high compression modes on small data [WIP] Minor compression ratio improvements for high compression modes on small data Jan 21, 2022
@Cyan4973 Cyan4973 marked this pull request as draft January 21, 2022 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants