-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup of ZopfliFindLongestMatch -- gives ~8% overall performance boost #110
base: master
Are you sure you want to change the base?
Conversation
Thanks for the feedback! would be really interesting to see GCC's generated code before and after for that function for ARM. I can't think why there'd be a regression. Do you see similar results on x86/x64? I assume this testing is with your modified fork rather than the verbatim zopfli codebase? |
I've reverted the order of comparisons change due to the regression that you've noticed. re: size/space trade off.. A lot of those variables end up in registers in clang, so there's no extra memory tradeoff at all. I guess that'll be platform specific too, but I expect to see that in most cases. And perhaps gcc is just better at removing the unnecessary masking when appropriate than clang is. |
enwik8 --i15
|
Thanks @lorents17 for the info! What compiler/platform were you testing with? I see that you've measured a ~5% speedup.. but the difference in file size is interesting to me. Did your original zopfli results include the changes from 37f6da6 -- specifically katajainen.c? Otherwise I need to re-examing the changes to see why there could be a change. Thanks! |
OS - Windows 10 I used original zopfli with the latest changes from Apr 25, 2016 Interests me more as the ECT project could squeeze so quickly? |
I've tried before and after my changes on OSX/Apple LLVM v 7.3 and I'm getting a consistent 34967390 bytes in each case. I'll try and setup a system with gcc later to test further. as to the performance: |
ECT (which I am the author of) is much faster than zopfli at similar compression, even without multithreading. Multithreading is off by default as it causes compression losses and needs more processing power per byte. |
@fhanau Thanks for the info and clarifications! Are there any downsides to importing your binary tree based code into the mainline? |
Unfortunately, yes:
But you're right, the changes from ECT should be backported. |
Most of changes don't make any difference on ARMv7 when GCC is used, there is no way for me to efficiently run x86 tests as I run Fedora in VirtualBox (though it now supports AVX instructions set and stuff but the CPU is not "as free" as on Odroid U3). Of course ARM has a lot of registers so that may be the reason. Only the loop part change did a very little speed up, any variable type or order changing (here as well as with previous changes) did not make any difference or were actually slower. However, the differences are so small they could as well fall into a margin of error difference. As far as I'm concerned, GCC 4.8 is faster than GCC 5.3 which is faster than Clang 3.6. Or so it did score on ARMv7, would need to recompare it for original Zopfli again to confirm that Clang is still slower. Even with my MrKrzYch00@7aa51e1 I don't see any difference in speed. :) ATM the only thing I'm concerned about in case of Zopfli is why do I get:
which only occurs on my Odroid U3 (both of them) when freeing |
Check jthlim@b5a1ea2 for a detailed discussion of each change next to the associated code.
Please help verify that all of the comments/statements are valid from your perspective too!
Note: the 8% improvement is compounded on top of the optimizations discussed before (having asserts off, integer cost comparison, -ffast-math, link time optimization, etc.)
Note: Testing compressing hundreds of files has verified so far that these changes to not affect the compression.