-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed is half of normal jpeg decoders (on ARM) #202
Comments
What 'normal decoder' are your referring to? |
opencv's builtin encoder. |
OpenCV does not have a builtin encoder, it uses either |
@HeroicKatora Sounds like it is similar. But I have a little bit different idea: Maybe another cause is SIMD? Because libjpeg-turbo says it uses SIMD a lot to speed up. It even uses hand-crafted assembly code. |
@HeroicKatora I have noticed that NEON intrinsics are currently only available on nightly. I can think of 3 options for adding SIMD support on NEON, and I'm not sure which is the best.
What are your thoughts? |
How likely is it that the NOEN intrinsics will be stabilized in their current form? If it is just a matter of waiting for them to reach stable, then I think that's probably the best option. We'd just leave it behind a feature flag until they were available on stable at our MSRV. What worries more is the testing/maintenance strategy given that the CI and most developers' machines are currently x86. (Not a deal-breaker mind you, just something to figure out so we don't find ourselves scrambling to track down an ARM machine because we need to test an urgent patch or something). |
As for being stabilized in their current form, considering that they look pretty much identical to their C counterpart (as do the x86 ones), I'd say we're probably fine. For testing/CI, luckily user-mode QEMU exists :) for libjxl we configured the GitHub CI like this - https://github.com/libjxl/libjxl/blob/main/.github/workflows/build_test.yml#L225 - I assume something similar could be set up for rust, although of course this wouldn't allow benchmarks. |
Regarding NEON, I wonder if it were possible to convince LLVM to auto-vectorize enough and in particular utilizing instructions. We can have specialized methods with explicit #[target_feature(enable = "avx2")]
pub unsafe fn sum_avx2(vec: &[u32]) -> u32 {
let mut n = 0u32;
for &i in vec {
n = n.wrapping_add(i);
}
n
}
pub fn sum_me(vec: &[u32]) -> u32 {
let mut n = 0u32;
for &i in vec {
n = n.wrapping_add(i);
}
n
} |
Ah well, looks like the feature names and annotation for ARM aren't stable either. https://rustc.godbolt.org/z/bcr59rTaa |
In my experience, for this kind of code, autovectorization doesn't really
bring you far enough...
I'll give it a try in the next few days and we can see where to go from
there.
…On Wed, 22 Dec 2021, 20:50 HeroicKatora, ***@***.***> wrote:
Ah well, looks like the feature names and annotation for ARM aren't stable
either. https://rustc.godbolt.org/z/bcr59rTaa
Kind of odd that the pure codegen appears to be tied so closely to the
stabilization of the arch module intrinsics, isn't it?
—
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOPAI5NMRV7KWJBAV72UJLUSITZ5ANCNFSM5FZ2T4XA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'm not a fan of proc macros but it might be worth to at least give it a try via the multiversion crate |
The code is a direct translation of the sse3 code for x86, and provides approximately a 0.67x decrease of image decoding time on a Samsung A51. Fixes image-rs#202, or at least improves the situation. decode a 2268x1512 JPEG time: [83.619 ms 85.692 ms 87.848 ms] decode a 2268x1512 JPEG time: [56.209 ms 57.019 ms 57.876 ms] change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
The code is a direct translation of the sse3 code for x86, and provides approximately a 0.67x decrease of image decoding time on a Samsung A51. Fixes image-rs#202, or at least improves the situation. decode a 2268x1512 JPEG time: [83.619 ms 85.692 ms 87.848 ms] decode a 2268x1512 JPEG time: [56.209 ms 57.019 ms 57.876 ms] change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
The code is a direct translation of the sse3 code for x86, and provides approximately a 0.67x decrease of image decoding time on a Samsung A51. Fixes image-rs#202, or at least improves the situation. decode a 2268x1512 JPEG time: [83.619 ms 85.692 ms 87.848 ms] decode a 2268x1512 JPEG time: [56.209 ms 57.019 ms 57.876 ms] change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
The code is a direct translation of the sse3 code for x86, and provides approximately a 0.67x decrease of image decoding time on a Samsung A51. Fixes image-rs#202, or at least improves the situation. decode a 2268x1512 JPEG time: [83.619 ms 85.692 ms 87.848 ms] decode a 2268x1512 JPEG time: [56.209 ms 57.019 ms 57.876 ms] change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
Hi thanks for the lib! However, the speed of this library seems to be half of normal jpeg decoders. When running on ARM, normal decoder uses 600ms to decode a 4000x3000 color JPEG, while this lib uses 1100ms.
The text was updated successfully, but these errors were encountered: