Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No count in bit writer #116

Merged
merged 3 commits into from
Nov 21, 2024
Merged

No count in bit writer #116

merged 3 commits into from
Nov 21, 2024

Conversation

Melirius
Copy link
Collaborator

@Melirius Melirius commented Nov 16, 2024

This PR extends internal low_value buffer of bit writer and sends as many bytes as possible when number of stream bits in low_value exceeds MAX_STREAM_BITS. It gives substantial performance gain due to much better branch prediction on if count >= 0 - now it is very rarely true.

Current main 7ee30b6

2024-11-16T21:36:50.846Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-16T21:36:51.329Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-16T21:36:52.735Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9667ms of CPU time in 1405ms of wall time
2024-11-16T21:36:52.735Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-16T21:36:54.336Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324076 bytes (compression = 28.0%)
2024-11-16T21:36:54.336Z INFO  [lepton_jpeg_util] Main thread CPU: 3489ms, Worker thread CPU: 20666 ms, walltime: 3489 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       872 864 510      cache-references                                                        (42,03%)
        81 345 085      cache-misses                     #    9,32% of all cache refs           (42,20%)
    15 679 683 044      cycles                                                                  (41,93%)
       844 619 245      ic_fetch_stall.ic_stall_back_pressure                                        (41,82%)
     1 127 999 657      stalled-cycles-frontend          #    7,19% frontend cycles idle        (41,57%)
    38 437 122 047      instructions                     #    2,45  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,59%)
     4 360 833 730      branch-instructions                                                     (41,79%)
       164 653 082      branch-misses                    #    3,78% of all branches             (41,85%)
     5 188 234 309      ic_fetch_stall.ic_stall_any                                             (42,05%)
        36 892 099      ic_fetch_stall.ic_stall_dq_empty                                        (42,17%)
        68 333 545      l2_cache_misses_from_ic_miss                                            (42,13%)
     2 117 822 829      l2_latency.l2_cycles_waiting_on_fills                                        (42,01%)
           184 004      faults                                                                
                 1      migrations                                                            

       3,523626815 seconds time elapsed

       3,201068000 seconds user
       0,320906000 seconds sys

This PR 0426f91

2024-11-16T21:38:12.721Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-16T21:38:13.198Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-16T21:38:14.474Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 8782ms of CPU time in 1274ms of wall time
2024-11-16T21:38:14.474Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-16T21:38:16.077Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2024-11-16T21:38:16.077Z INFO  [lepton_jpeg_util] Main thread CPU: 3356ms, Worker thread CPU: 19835 ms, walltime: 3356 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       841 950 286      cache-references                                                        (42,16%)
        74 634 666      cache-misses                     #    8,86% of all cache refs           (41,91%)
    15 098 618 381      cycles                                                                  (41,72%)
       915 407 820      ic_fetch_stall.ic_stall_back_pressure                                        (41,58%)
       986 670 263      stalled-cycles-frontend          #    6,53% frontend cycles idle        (41,57%)
    38 910 234 032      instructions                     #    2,58  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,53%)
     4 265 509 081      branch-instructions                                                     (41,82%)
       147 881 884      branch-misses                    #    3,47% of all branches             (42,01%)
     5 396 121 804      ic_fetch_stall.ic_stall_any                                             (42,07%)
        40 630 197      ic_fetch_stall.ic_stall_dq_empty                                        (42,20%)
        61 276 546      l2_cache_misses_from_ic_miss                                            (42,38%)
     2 181 284 237      l2_latency.l2_cycles_waiting_on_fills                                        (42,33%)
           184 036      faults                                                                
                 1      migrations                                                            

       3,391659429 seconds time elapsed

       3,064401000 seconds user
       0,323831000 seconds sys

@Melirius
Copy link
Collaborator Author

Melirius commented Nov 16, 2024

As in this variant of buffer it is difficult to achieve the same behavior as initial Lepton implementation on very small buffers (specifically before minimal stream length is 4 bytes, while here it can be shorter writing down only real stream bytes) - hashes of streams in two tests are changed.

I have checked that DropBox Lepton can decode files obtained by this implementation.

@Melirius
Copy link
Collaborator Author

Taking into account that decoding of the formed Lepton file takes $7.3\times10^9$ cycles, encoding here takes $7.8\times10^9$ cycles that is only 7 % slower.

@Melirius
Copy link
Collaborator Author

Now it is even faster:

2024-11-17T09:12:57.266Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-17T09:12:57.747Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-17T09:12:58.993Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 8504ms of CPU time in 1246ms of wall time
2024-11-17T09:12:58.993Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-17T09:13:00.611Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2024-11-17T09:13:00.611Z INFO  [lepton_jpeg_util] Main thread CPU: 3344ms, Worker thread CPU: 19654 ms, walltime: 3344 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       860 171 569      cache-references                                                        (41,57%)
        79 969 138      cache-misses                     #    9,30% of all cache refs           (41,63%)
    15 030 765 275      cycles                                                                  (41,74%)
       565 084 589      ic_fetch_stall.ic_stall_back_pressure                                        (42,03%)
     1 035 372 456      stalled-cycles-frontend          #    6,89% frontend cycles idle        (42,18%)
    37 563 953 194      instructions                     #    2,50  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (42,13%)
     4 192 550 558      branch-instructions                                                     (42,19%)
       148 847 381      branch-misses                    #    3,55% of all branches             (42,17%)
     5 334 304 010      ic_fetch_stall.ic_stall_any                                             (42,11%)
        40 808 153      ic_fetch_stall.ic_stall_dq_empty                                        (41,96%)
        65 332 752      l2_cache_misses_from_ic_miss                                            (41,88%)
     2 049 967 756      l2_latency.l2_cycles_waiting_on_fills                                        (41,67%)
           184 019      faults                                                                
                 1      migrations                                                            

       3,380048890 seconds time elapsed

       3,042535000 seconds user
       0,335838000 seconds sys

@mcroomp
Copy link
Collaborator

mcroomp commented Nov 18, 2024

That's great! I'll need to test with the corpus to make sure that all the files are decodablet that will take a couple days.

@Melirius
Copy link
Collaborator Author

Using the same scheme of making checks more rare as in reader, I've got even to

2024-11-19T21:00:57.665Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2024-11-19T21:00:58.129Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2024-11-19T21:00:59.328Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 8281ms of CPU time in 1198ms of wall time
2024-11-19T21:00:59.328Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2024-11-19T21:01:00.907Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2024-11-19T21:01:00.907Z INFO  [lepton_jpeg_util] Main thread CPU: 3241ms, Worker thread CPU: 19155 ms, walltime: 3241 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       841 064 356      cache-references                                                        (42,25%)
        78 942 226      cache-misses                     #    9,39% of all cache refs           (41,95%)
    14 732 203 268      cycles                                                                  (41,82%)
       826 513 797      ic_fetch_stall.ic_stall_back_pressure                                        (41,74%)
     1 007 162 397      stalled-cycles-frontend          #    6,84% frontend cycles idle        (41,64%)
    36 880 688 419      instructions                     #    2,50  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,77%)
     4 299 054 828      branch-instructions                                                     (41,94%)
       158 781 346      branch-misses                    #    3,69% of all branches             (41,92%)
     5 091 345 293      ic_fetch_stall.ic_stall_any                                             (42,02%)
        36 590 376      ic_fetch_stall.ic_stall_dq_empty                                        (42,07%)
        67 544 871      l2_cache_misses_from_ic_miss                                            (42,14%)
     1 892 218 221      l2_latency.l2_cycles_waiting_on_fills                                        (42,28%)
           184 062      faults                                                                
                 1      migrations                                                            

       3,275811568 seconds time elapsed

       2,962062000 seconds user
       0,311901000 seconds sys

After merging this PR I prepare another with these improvements.

@Melirius
Copy link
Collaborator Author

Rebased.

@Melirius
Copy link
Collaborator Author

@mcroomp, any news about corpus test?

@mcroomp mcroomp closed this Nov 21, 2024
@mcroomp mcroomp reopened this Nov 21, 2024
@mcroomp
Copy link
Collaborator

mcroomp commented Nov 21, 2024

Looks good... looks like it produces the same bits as before (except for one byte less sometimes at the end).

@mcroomp mcroomp merged commit 6c065e0 into main Nov 21, 2024
4 checks passed
@mcroomp mcroomp deleted the No-count-in-bit-writer branch November 21, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants