Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorter blocks and consistent file #112

Merged
merged 10 commits into from
Nov 15, 2024
Merged

Shorter blocks and consistent file #112

merged 10 commits into from
Nov 15, 2024

Conversation

Melirius
Copy link
Collaborator

@Melirius Melirius commented Nov 14, 2024

Lepton format allows to write blocks of standard length (65536 bytes) shorter, with 1-byte header instead of 3-bytes. Additionally, current implementation writes down encoded blocks into the file in the order of their receiving from different threads - that leads to different output files across runs.

This PR makes the output file shorter and the order of blocks in it predefined by carouseling. Performance is increased by ~3 % due to omission of buffer size check at each stream bit pushing.

Current best of 3 runs:

2024-11-14T17:00:03.681Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-14T17:00:04.152Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-14T17:00:05.668Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 10212ms of CPU time in 1514ms of wall time
2024-11-14T17:00:05.668Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-14T17:00:07.276Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324596 bytes (compression = 28.0%)
2024-11-14T17:00:07.276Z INFO  [lepton_jpeg_util] Main thread CPU: 3595ms, Worker thread CPU: 21306 ms, walltime: 3595 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       855 564 085      cache-references                                                        (42,07%)
        86 316 838      cache-misses                     #   10,09% of all cache refs           (42,21%)
    16 099 617 774      cycles                                                                  (42,31%)
       806 699 300      ic_fetch_stall.ic_stall_back_pressure                                        (42,10%)
     1 061 388 071      stalled-cycles-frontend          #    6,59% frontend cycles idle        (41,84%)
    40 125 811 710      instructions                     #    2,49  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,72%)
     4 707 746 288      branch-instructions                                                     (41,69%)
       165 544 725      branch-misses                    #    3,52% of all branches             (41,64%)
     5 309 260 356      ic_fetch_stall.ic_stall_any                                             (41,56%)
        43 294 231      ic_fetch_stall.ic_stall_dq_empty                                        (41,80%)
        73 167 746      l2_cache_misses_from_ic_miss                                            (42,11%)
     2 188 259 483      l2_latency.l2_cycles_waiting_on_fills                                        (42,11%)
           183 928      faults                                                                
                 1      migrations                                                            

       3,628985546 seconds time elapsed

       3,289304000 seconds user
       0,336724000 seconds sys

This PR c7e0f7c best of 3 runs:

2024-11-14T21:13:33.559Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-14T21:13:34.039Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-14T21:13:35.420Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9299ms of CPU time in 1379ms of wall time
2024-11-14T21:13:35.420Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-14T21:13:37.018Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324076 bytes (compression = 28.0%)
2024-11-14T21:13:37.018Z INFO  [lepton_jpeg_util] Main thread CPU: 3458ms, Worker thread CPU: 20273 ms, walltime: 3458 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       834 601 112      cache-references                                                        (42,12%)
        64 932 273      cache-misses                     #    7,78% of all cache refs           (42,04%)
    15 586 303 408      cycles                                                                  (42,04%)
       838 269 598      ic_fetch_stall.ic_stall_back_pressure                                        (41,98%)
     1 061 874 437      stalled-cycles-frontend          #    6,81% frontend cycles idle        (41,79%)
    38 330 557 068      instructions                     #    2,46  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,68%)
     4 353 138 372      branch-instructions                                                     (41,68%)
       163 333 881      branch-misses                    #    3,75% of all branches             (41,72%)
     5 226 409 949      ic_fetch_stall.ic_stall_any                                             (41,86%)
        39 792 649      ic_fetch_stall.ic_stall_dq_empty                                        (41,92%)
        53 084 247      l2_cache_misses_from_ic_miss                                            (41,96%)
     2 066 007 058      l2_latency.l2_cycles_waiting_on_fills                                        (42,16%)
           183 967      faults                                                                
                 1      migrations                                                            

       3,492682793 seconds time elapsed

       3,175743000 seconds user
       0,313777000 seconds sys

Makes Lepton file consistent between runs and helps for decoding of files transferred over slow channels
Carouseling effectively mixes blocks, so we can exclude old mixing mechanism that checks length after each stream bit pushing - ~3 % faster
Copy link
Collaborator

@mcroomp mcroomp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

}

let mut curr_write_thread: usize = 0;
while threads_left > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if let Some(a) = packets[curr_write_thread].pop_front()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

} else {
*tmp_range = split;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a cleaner version of this in the cabac project
https://github.com/mcroomp/cabac/blob/77be08aadf92b8fc6d9cd37d17cec0fcef55605b/src/vp8.rs#L366

completely eliminates adjusting the carry in the buffer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll apply this thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be applied in the next PR on bit writer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if you could remove the bool writer changes here and I can approve. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed - only changes about buffer flashing remain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer flushing is unnecessary with the new bool_writer changes since it doesn't require a separate buffer anymore at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this will be in the next PR :)

@Melirius
Copy link
Collaborator Author

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

Separate into 2 PRs? Or leave like this?

@mcroomp
Copy link
Collaborator

mcroomp commented Nov 15, 2024

Better to separate out the multiplexer changes and work on the bitwriter changes separately.

Separate into 2 PRs? Or leave like this?

I think in 2 PRs would be better since they aren't really related. Trying to be extra careful with the write codepath since corruptions on the write side are much more dangerous. Thanks!

@Melirius
Copy link
Collaborator Author

Melirius commented Nov 15, 2024

This version f655f27 is slower but still faster than main branch:

2024-11-15T11:35:16.838Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-15T11:35:17.308Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-15T11:35:18.756Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9592ms of CPU time in 1447ms of wall time
2024-11-15T11:35:18.756Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-15T11:35:20.364Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324076 bytes (compression = 28.0%)
2024-11-15T11:35:20.364Z INFO  [lepton_jpeg_util] Main thread CPU: 3526ms, Worker thread CPU: 20654 ms, walltime: 3526 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       867 203 526      cache-references                                                        (42,48%)
        84 794 880      cache-misses                     #    9,78% of all cache refs           (42,32%)
    15 852 334 145      cycles                                                                  (42,23%)
     1 013 465 354      ic_fetch_stall.ic_stall_back_pressure                                        (42,01%)
     1 069 301 834      stalled-cycles-frontend          #    6,75% frontend cycles idle        (41,84%)
    39 172 164 846      instructions                     #    2,47  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,67%)
     4 396 742 683      branch-instructions                                                     (41,58%)
       165 106 663      branch-misses                    #    3,76% of all branches             (41,46%)
     5 327 122 036      ic_fetch_stall.ic_stall_any                                             (41,54%)
        38 628 152      ic_fetch_stall.ic_stall_dq_empty                                        (41,71%)
        71 604 981      l2_cache_misses_from_ic_miss                                            (41,92%)
     2 071 017 680      l2_latency.l2_cycles_waiting_on_fills                                        (42,27%)
           183 950      faults                                                                
                 1      migrations                                                            

       3,561160101 seconds time elapsed

       3,216183000 seconds user
       0,342806000 seconds sys

I'll prepare a separate bit writer PR after merge of this one.

@mcroomp mcroomp merged commit 82d6547 into main Nov 15, 2024
3 checks passed
@mcroomp mcroomp deleted the Shorter-blocks branch November 15, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants