Create SIMD-accelerated version of `compute_gru` function #191

Ameobea · 2021-07-19T03:54:56Z

I did some CPU profiling of pulseeffects / easyeffects which includes this library as a dependency for its noise reduction functionality. This led me to see that the compute_gru function was the hottest one in the whole application:

The changes in this pull request create a new function compute_gru_avx which has the same functionality as compute_gru but uses SIMD intrinsics to accelerate the function dramatically. The main changes focus on computing the sums in the GRU function 8 at a time and using FMAs to combine multiplications and adds into a single operation, increasing accuracy as a result. This also serves to reduce the overhead of loop counter checking which my profiling let me to believe was the most expensive part of the whole function before these changes. Additionally, converting the weights (which are stored as 8-bit signed integers) is done 8 at a time using SIMD for an additional speedup.

I also made some changes to the build configuration for compiler flags that use -O3 instead of -O2 which yielded some benefits on my machine as well as pasing -march=native which facilitates the SIMD used. If these CPU features aren't available, they will be disabled at build-time.

After the optimizations, the compute_gru_avx function uses only ~4.4% of the total CPU time compared to ~19.63% from before - a 4.45x speedup:

Here is a compiler explorer that shows the full assembly produced by the optimized compute_gru_avx function: https://c.godbolt.org/z/xzEGxj8ne

Testing done on my own machine using pulseeffects + librnnoise.so built with the optimized code seems to work identically to before with reduced CPU usage for the application.

Let me know if you think this is something that you'd like to get merged into the project. I'm happy to make any changes necessary. There may be a better/different way you'd like to handle the CPU feature detection and and I'd love suggestions on how to handle that.

* Add `-O3 -march=native` to the compiler flags in the autoconf/automake/autoetc. stuff * Optimize biquad filter implementation

a-rose

Good work ! I have worked on an AVX implementation, so there are a few pitfalls I can help you avoid. I'm not a maintainer on this repo though, so I can't guarantee this will ever be accepeted.

src/rnn.c

Ameobea · 2021-07-20T21:32:56Z

@a-rose thank you very much for the review. I'm glad you were aware of the fact that FMA support doesn't always exist when AVX2 support does.

* We already have the value in a register, so avoid spilling to stack and reading back

Update FMA check

a-rose · 2021-07-21T15:21:21Z

Thanks for taking the time to look into it :) I have created a PR on your fork to improve SIMD detection: Ameobea#1

Fix SIMD flags detection

Ameobea · 2021-07-25T02:27:01Z

I have also opened a pull request on the Xiph-run Gitlab instance referenced in the README: https://gitlab.xiph.org/xiph/rnnoise/-/merge_requests/2

xiph/rnnoise#191

Original merge request: xiph/rnnoise#191

jmvalin and others added 7 commits March 12, 2021 02:05

Add link to paper and demo to the README

7f449bf

Optimize

aa3d1e0

Optimize

5611925

Working SIMD-accelerated compute_gru

b0e4524

Add compiler flags to build config + optimize biquad filter

da3700a

* Add `-O3 -march=native` to the compiler flags in the autoconf/automake/autoetc. stuff * Optimize biquad filter implementation

Fix warnings

184f552

Remove dead code

59418c0

a-rose suggested changes Jul 20, 2021

View reviewed changes

src/rnn.c Outdated Show resolved Hide resolved

src/rnn.c Outdated Show resolved Hide resolved

src/rnn.c Outdated Show resolved Hide resolved

Ameobea added 2 commits July 20, 2021 14:20

Add fallback for if FMA isn't available

4c4f961

Use memcpy instead of explicit element-by-element copy

f8f7c69

Ameobea and others added 2 commits July 20, 2021 17:09

Use _m256_set1_ps instead of _m256_broadcast_ss

3f1e06f

* We already have the value in a register, so avoid spilling to stack and reading back

Fix AVX,AVX2,FMA detection and check compatibility only once

866f880

Update FMA check

Merge pull request #1 from a-rose/optimize-fix

afa5ad5

Fix SIMD flags detection

toofar mentioned this pull request Nov 7, 2022

PipeWire RNNoise Filter appears to cause skipping/stutter in audio output wwmm/easyeffects#1021

Open

JellyBrick added a commit to JellyBrick/noise-suppression-for-voice-simd that referenced this pull request Feb 7, 2023

feat: create SIMD-accelerated version of compute_gru function

6e85d56

xiph/rnnoise#191

Hartmnt pushed a commit to mumble-voip/ReNameNoise that referenced this pull request Apr 3, 2024

FIX: Use memcpy instead of explicit element-by-element copy

345fe06

Original merge request: xiph/rnnoise#191

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create SIMD-accelerated version of `compute_gru` function #191

Create SIMD-accelerated version of `compute_gru` function #191

Ameobea commented Jul 19, 2021 •

edited

Loading

a-rose left a comment •

edited

Loading

Ameobea commented Jul 20, 2021

a-rose commented Jul 21, 2021

Ameobea commented Jul 25, 2021

Create SIMD-accelerated version of compute_gru function #191

Are you sure you want to change the base?

Create SIMD-accelerated version of compute_gru function #191

Conversation

Ameobea commented Jul 19, 2021 • edited Loading

a-rose left a comment • edited Loading

Choose a reason for hiding this comment

Ameobea commented Jul 20, 2021

a-rose commented Jul 21, 2021

Ameobea commented Jul 25, 2021

Create SIMD-accelerated version of `compute_gru` function #191

Create SIMD-accelerated version of `compute_gru` function #191

Ameobea commented Jul 19, 2021 •

edited

Loading

a-rose left a comment •

edited

Loading