Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed is half of normal jpeg decoders (on ARM) #202

Closed
fzyzcjy opened this issue Oct 12, 2021 · 11 comments · Fixed by #221
Closed

Speed is half of normal jpeg decoders (on ARM) #202

fzyzcjy opened this issue Oct 12, 2021 · 11 comments · Fixed by #221

Comments

@fzyzcjy
Copy link

fzyzcjy commented Oct 12, 2021

Hi thanks for the lib! However, the speed of this library seems to be half of normal jpeg decoders. When running on ARM, normal decoder uses 600ms to decode a 4000x3000 color JPEG, while this lib uses 1100ms.

@HeroicKatora
Copy link
Member

What 'normal decoder' are your referring to?

@fzyzcjy
Copy link
Author

fzyzcjy commented Oct 22, 2021

opencv's builtin encoder.

@HeroicKatora
Copy link
Member

OpenCV does not have a builtin encoder, it uses either libjpeg or libjpeg-turbo. Should this be closed as a duplicate of #155 ?

@fzyzcjy
Copy link
Author

fzyzcjy commented Oct 24, 2021

@HeroicKatora Sounds like it is similar. But I have a little bit different idea: Maybe another cause is SIMD? Because libjpeg-turbo says it uses SIMD a lot to speed up. It even uses hand-crafted assembly code.

@veluca93
Copy link
Contributor

@HeroicKatora I have noticed that NEON intrinsics are currently only available on nightly. I can think of 3 options for adding SIMD support on NEON, and I'm not sure which is the best.

  • Use an external asm file that contains the NEON-specific code. This would work, but it's a huge pain (among other things, one needs to care about calling conventions and the like...)
  • Use an external c file containing functions implemented with intrinsics. I believe this is entirely superior to the asm approach.
  • Only support NEON on nightly. By far the easiest option, and saves us the pain of writing the CPU detection code if we want to support arm and not just aarch64...

What are your thoughts?

@fintelia
Copy link
Contributor

How likely is it that the NOEN intrinsics will be stabilized in their current form? If it is just a matter of waiting for them to reach stable, then I think that's probably the best option. We'd just leave it behind a feature flag until they were available on stable at our MSRV.

What worries more is the testing/maintenance strategy given that the CI and most developers' machines are currently x86. (Not a deal-breaker mind you, just something to figure out so we don't find ourselves scrambling to track down an ARM machine because we need to test an urgent patch or something).

@veluca93
Copy link
Contributor

As for being stabilized in their current form, considering that they look pretty much identical to their C counterpart (as do the x86 ones), I'd say we're probably fine.

For testing/CI, luckily user-mode QEMU exists :) for libjxl we configured the GitHub CI like this - https://github.com/libjxl/libjxl/blob/main/.github/workflows/build_test.yml#L225 - I assume something similar could be set up for rust, although of course this wouldn't allow benchmarks.

@HeroicKatora
Copy link
Member

HeroicKatora commented Dec 22, 2021

Regarding NEON, I wonder if it were possible to convince LLVM to auto-vectorize enough and in particular utilizing instructions. We can have specialized methods with explicit target_feature(enable = "…") annotations. For example, consider and compare the instruction output here:

#[target_feature(enable = "avx2")]
pub unsafe fn sum_avx2(vec: &[u32]) -> u32 {
    let mut n = 0u32;
    for &i in vec {
        n = n.wrapping_add(i);
    }
    n
}

pub fn sum_me(vec: &[u32]) -> u32 {
    let mut n = 0u32;
    for &i in vec {
        n = n.wrapping_add(i);
    }
    n
}

playground

@HeroicKatora
Copy link
Member

Ah well, looks like the feature names and annotation for ARM aren't stable either. https://rustc.godbolt.org/z/bcr59rTaa
Kind of odd that the pure codegen appears to be tied so closely to the stabilization of the arch module intrinsics, isn't it?

@veluca93
Copy link
Contributor

veluca93 commented Dec 22, 2021 via email

@paolobarbolini
Copy link
Contributor

Regarding NEON, I wonder if it were possible to convince LLVM to auto-vectorize enough and in particular utilizing instructions.

I'm not a fan of proc macros but it might be worth to at least give it a try via the multiversion crate

veluca93 added a commit to veluca93/jpeg-decoder that referenced this issue Feb 13, 2022
The code is a direct translation of the sse3 code for x86, and provides
approximately a 0.67x decrease of image decoding time on a Samsung A51.

Fixes image-rs#202, or at least improves the situation.

decode a 2268x1512 JPEG time:   [83.619 ms 85.692 ms 87.848 ms]

decode a 2268x1512 JPEG time:   [56.209 ms 57.019 ms 57.876 ms]
                        change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
veluca93 added a commit to veluca93/jpeg-decoder that referenced this issue Feb 14, 2022
The code is a direct translation of the sse3 code for x86, and provides
approximately a 0.67x decrease of image decoding time on a Samsung A51.

Fixes image-rs#202, or at least improves the situation.

decode a 2268x1512 JPEG time:   [83.619 ms 85.692 ms 87.848 ms]

decode a 2268x1512 JPEG time:   [56.209 ms 57.019 ms 57.876 ms]
                        change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
veluca93 added a commit to veluca93/jpeg-decoder that referenced this issue Feb 14, 2022
The code is a direct translation of the sse3 code for x86, and provides
approximately a 0.67x decrease of image decoding time on a Samsung A51.

Fixes image-rs#202, or at least improves the situation.

decode a 2268x1512 JPEG time:   [83.619 ms 85.692 ms 87.848 ms]

decode a 2268x1512 JPEG time:   [56.209 ms 57.019 ms 57.876 ms]
                        change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
wartmanm pushed a commit to wartmanm/jpeg-decoder that referenced this issue Sep 15, 2022
The code is a direct translation of the sse3 code for x86, and provides
approximately a 0.67x decrease of image decoding time on a Samsung A51.

Fixes image-rs#202, or at least improves the situation.

decode a 2268x1512 JPEG time:   [83.619 ms 85.692 ms 87.848 ms]

decode a 2268x1512 JPEG time:   [56.209 ms 57.019 ms 57.876 ms]
                        change: [-35.318% -33.460% -31.535%] (p = 0.00 < 0.05)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants