Shorter blocks and consistent file #112

Melirius · 2024-11-14T15:20:56Z

Lepton format allows to write blocks of standard length (65536 bytes) shorter, with 1-byte header instead of 3-bytes. Additionally, current implementation writes down encoded blocks into the file in the order of their receiving from different threads - that leads to different output files across runs.

This PR makes the output file shorter and the order of blocks in it predefined by carouseling. Performance is increased by ~3 % due to omission of buffer size check at each stream bit pushing.

Current best of 3 runs:

2024-11-14T17:00:03.681Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-14T17:00:04.152Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-14T17:00:05.668Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 10212ms of CPU time in 1514ms of wall time
2024-11-14T17:00:05.668Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-14T17:00:07.276Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324596 bytes (compression = 28.0%)
2024-11-14T17:00:07.276Z INFO  [lepton_jpeg_util] Main thread CPU: 3595ms, Worker thread CPU: 21306 ms, walltime: 3595 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       855 564 085      cache-references                                                        (42,07%)
        86 316 838      cache-misses                     #   10,09% of all cache refs           (42,21%)
    16 099 617 774      cycles                                                                  (42,31%)
       806 699 300      ic_fetch_stall.ic_stall_back_pressure                                        (42,10%)
     1 061 388 071      stalled-cycles-frontend          #    6,59% frontend cycles idle        (41,84%)
    40 125 811 710      instructions                     #    2,49  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,72%)
     4 707 746 288      branch-instructions                                                     (41,69%)
       165 544 725      branch-misses                    #    3,52% of all branches             (41,64%)
     5 309 260 356      ic_fetch_stall.ic_stall_any                                             (41,56%)
        43 294 231      ic_fetch_stall.ic_stall_dq_empty                                        (41,80%)
        73 167 746      l2_cache_misses_from_ic_miss                                            (42,11%)
     2 188 259 483      l2_latency.l2_cycles_waiting_on_fills                                        (42,11%)
           183 928      faults                                                                
                 1      migrations                                                            

       3,628985546 seconds time elapsed

       3,289304000 seconds user
       0,336724000 seconds sys

This PR c7e0f7c best of 3 runs:

2024-11-14T21:13:33.559Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-14T21:13:34.039Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-14T21:13:35.420Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9299ms of CPU time in 1379ms of wall time
2024-11-14T21:13:35.420Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-14T21:13:37.018Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324076 bytes (compression = 28.0%)
2024-11-14T21:13:37.018Z INFO  [lepton_jpeg_util] Main thread CPU: 3458ms, Worker thread CPU: 20273 ms, walltime: 3458 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       834 601 112      cache-references                                                        (42,12%)
        64 932 273      cache-misses                     #    7,78% of all cache refs           (42,04%)
    15 586 303 408      cycles                                                                  (42,04%)
       838 269 598      ic_fetch_stall.ic_stall_back_pressure                                        (41,98%)
     1 061 874 437      stalled-cycles-frontend          #    6,81% frontend cycles idle        (41,79%)
    38 330 557 068      instructions                     #    2,46  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,68%)
     4 353 138 372      branch-instructions                                                     (41,68%)
       163 333 881      branch-misses                    #    3,75% of all branches             (41,72%)
     5 226 409 949      ic_fetch_stall.ic_stall_any                                             (41,86%)
        39 792 649      ic_fetch_stall.ic_stall_dq_empty                                        (41,92%)
        53 084 247      l2_cache_misses_from_ic_miss                                            (41,96%)
     2 066 007 058      l2_latency.l2_cycles_waiting_on_fills                                        (42,16%)
           183 967      faults                                                                
                 1      migrations                                                            

       3,492682793 seconds time elapsed

       3,175743000 seconds user
       0,313777000 seconds sys

Makes Lepton file consistent between runs and helps for decoding of files transferred over slow channels

Carouseling effectively mixes blocks, so we can exclude old mixing mechanism that checks length after each stream bit pushing - ~3 % faster

mcroomp

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

mcroomp · 2024-11-14T15:38:28Z

src/structs/multiplexer.rs

+            }
+
+            let mut curr_write_thread: usize = 0;
+            while threads_left > 0 {


if let Some(a) = packets[curr_write_thread].pop_front()

mcroomp · 2024-11-14T19:44:47Z

src/structs/vpx_bool_writer.rs

        } else {
            *tmp_range = split;
-


I have a cleaner version of this in the cabac project
https://github.com/mcroomp/cabac/blob/77be08aadf92b8fc6d9cd37d17cec0fcef55605b/src/vp8.rs#L366

completely eliminates adjusting the carry in the buffer

Nice, I'll apply this thing.

Will be applied in the next PR on bit writer.

Ok if you could remove the bool writer changes here and I can approve. Thanks!

Removed - only changes about buffer flashing remain.

The buffer flushing is unnecessary with the new bool_writer changes since it doesn't require a separate buffer anymore at all.

Yes, but this will be in the next PR :)

Melirius · 2024-11-14T21:53:50Z

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

Separate into 2 PRs? Or leave like this?

mcroomp · 2024-11-15T05:50:57Z

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

Separate into 2 PRs? Or leave like this?

I think in 2 PRs would be better since they aren't really related. Trying to be extra careful with the write codepath since corruptions on the write side are much more dangerous. Thanks!

Melirius · 2024-11-15T11:37:32Z

This version f655f27 is slower but still faster than main branch:

2024-11-15T11:35:16.838Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-15T11:35:17.308Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-15T11:35:18.756Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9592ms of CPU time in 1447ms of wall time
2024-11-15T11:35:18.756Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-15T11:35:20.364Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324076 bytes (compression = 28.0%)
2024-11-15T11:35:20.364Z INFO  [lepton_jpeg_util] Main thread CPU: 3526ms, Worker thread CPU: 20654 ms, walltime: 3526 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       867 203 526      cache-references                                                        (42,48%)
        84 794 880      cache-misses                     #    9,78% of all cache refs           (42,32%)
    15 852 334 145      cycles                                                                  (42,23%)
     1 013 465 354      ic_fetch_stall.ic_stall_back_pressure                                        (42,01%)
     1 069 301 834      stalled-cycles-frontend          #    6,75% frontend cycles idle        (41,84%)
    39 172 164 846      instructions                     #    2,47  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,67%)
     4 396 742 683      branch-instructions                                                     (41,58%)
       165 106 663      branch-misses                    #    3,76% of all branches             (41,46%)
     5 327 122 036      ic_fetch_stall.ic_stall_any                                             (41,54%)
        38 628 152      ic_fetch_stall.ic_stall_dq_empty                                        (41,71%)
        71 604 981      l2_cache_misses_from_ic_miss                                            (41,92%)
     2 071 017 680      l2_latency.l2_cycles_waiting_on_fills                                        (42,27%)
           183 950      faults                                                                
                 1      migrations                                                            

       3,561160101 seconds time elapsed

       3,216183000 seconds user
       0,342806000 seconds sys

I'll prepare a separate bit writer PR after merge of this one.

Melirius added 2 commits November 14, 2024 15:21

Shorter blocks

9d9e746

Carousseling of different thread's blocks into Lepton file

4a228db

Makes Lepton file consistent between runs and helps for decoding of files transferred over slow channels

Melirius requested a review from mcroomp November 14, 2024 15:20

Melirius added 3 commits November 14, 2024 16:26

Carouseling from 0

2f8b978

Optimization of writer

81a8f23

Carouseling effectively mixes blocks, so we can exclude old mixing mechanism that checks length after each stream bit pushing - ~3 % faster

Assert and some comments

5d0671f

mcroomp reviewed Nov 14, 2024

View reviewed changes

Melirius added 2 commits November 14, 2024 22:24

Faster writing during decoding - flush_non_final_data each row

66b6964

Apply review round 1

c7e0f7c

Melirius added 2 commits November 15, 2024 12:33

Revew round 1 - revert bool writer changes

909d9b3

Removed flush_non_final_data chack after each bit

f655f27

Fix for 0-sized buffer

9bce3bc

mcroomp approved these changes Nov 15, 2024

View reviewed changes

mcroomp merged commit 82d6547 into main Nov 15, 2024
3 checks passed

mcroomp deleted the Shorter-blocks branch November 15, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shorter blocks and consistent file #112

Shorter blocks and consistent file #112

Melirius commented Nov 14, 2024 •

edited

Loading

mcroomp left a comment

mcroomp Nov 14, 2024

Melirius Nov 14, 2024

mcroomp Nov 14, 2024

Melirius Nov 14, 2024

Melirius Nov 15, 2024

mcroomp Nov 15, 2024

Melirius Nov 15, 2024

mcroomp Nov 15, 2024

Melirius Nov 15, 2024

Melirius commented Nov 14, 2024

mcroomp commented Nov 15, 2024

Melirius commented Nov 15, 2024 •

edited

Loading

Shorter blocks and consistent file #112

Shorter blocks and consistent file #112

Conversation

Melirius commented Nov 14, 2024 • edited Loading

mcroomp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Melirius commented Nov 14, 2024

mcroomp commented Nov 15, 2024

Melirius commented Nov 15, 2024 • edited Loading

Melirius commented Nov 14, 2024 •

edited

Loading

Melirius commented Nov 15, 2024 •

edited

Loading