Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Sergio0694 · 2021-01-21T18:50:59Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Follow up to #1513. This PR does a couple things:

Switch the resize kernel processing to float
Add an AVX2 vectorized method to normalize the kernel
Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead
Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:

vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]

After:

vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax]

Co-authored-by: Clinton Ingram <[email protected]>

…geSharp into js/Shuffle3Channel

Fix JpegDecoderTests.Decode_IsCancellable

Assembly for loading in the loop went from: ```asm vmovss xmm2, [rax] vbroadcastss xmm2, xmm2 vmovss xmm3, [rax+4] vbroadcastss xmm3, xmm3 vinsertf128 ymm2, ymm2, xmm3, 1 ``` To: ```asm vmovsd xmm3, [rax] vbroadcastsd ymm3, xmm3 vpermps ymm3, ymm1, ymm3 ```

See Vector256.Create issue: dotnet/runtime#47236

Speed improvements to resize kernel (w/ SIMD)

antonfirsov · 2021-01-23T14:04:11Z

@Sergio0694 looks like there are several tests failing in KernelMapTests, some of them are concerning, for example:
https://github.com/SixLabors/ImageSharp/pull/1518/checks?check_run_id=1745872105#step:7:272

[xUnit.net 00:01:20.95]     KernelMapContentIsCorrect<WelchResampler>(resampler: SixLabors.ImageSharp.Processing.Processors.Transforms.WelchResampler, srcSize: 300, destSize: 2008) [FAIL]
   0 != 0.33333334 @ (Row:50, Col:0)

Would be nice if we could dig out the git history, and check how @JimBobSquarePants 's original code used double-s before I introduced ResizeKernelMap, and try to figure out if the high accuracy calculations are serving real user needs or not.

Sergio0694 · 2021-01-23T14:27:43Z

@antonfirsov yeah I noticed that too, just hadn't had the time to work on that just yet 😅
That test result seems too off to just be a rounding issue, I'm thinking I might've just messed something up somewhere. Which is weird because all the actual resize tests are passing just fine, so that's... A bit concerning.

I'm thinking maybe we should split up this PR and only merge the vperms improvement here, and then possibly tackle the switch to float and the other vectorizations I've added here in there too, to make it easier to review and debug? 🤔

antonfirsov · 2021-01-28T14:18:32Z

all the actual resize tests are passing just fine

The resize tests do not cover all the cases. Kernel map creation is tested against all kinds of weird image dimensions + different resampler dimensions. It's not worth to run expensive end-to-end resize tests for all those combinations, instead we have unit tests in ResizeKernelMapTests. There is an extended set of those tests, I suggest to do a local run with those when you are done with the rest of the PR.

I'm thinking maybe we should split up this PR and only merge the vperms improvement here.

Splitting out would be great yeah! By "vperms improvement" you mean the parts implementing #1515?

Sergio0694 · 2021-01-28T22:36:15Z

The resize tests do not cover all the cases. Kernel map creation is tested against all kinds of weird image dimensions + different resampler dimensions. It's not worth to run expensive end-to-end resize tests for all those combinations, instead we have unit tests in ResizeKernelMapTests. There is an extended set of those tests, I suggest to do a local run with those when you are done with the rest of the PR.

Ooh I see, makes sense, thanks! Will take a look at those extra tests then 😄

Splitting out would be great yeah! By "vperms improvement" you mean the parts implementing #1515?

Yup, exactly - the bit that expands the factors buffer to length 4x and then removing the shuffle.
Will revert the other changes in this PR then, and eventually apply then in a different PR later on 👍

JimBobSquarePants · 2021-02-17T07:25:46Z

Sorry @Sergio0694 The introduction of Git LFS (and subsequent history rewrite) has broken this. I spent a couple of hours trying to do a merge with unmatched history but Git simply wont bend to my will. It'd be simpler to create a new branch from master, copy your changes into it and open a new PR.

JimBobSquarePants · 2024-08-14T13:25:50Z

So... I decided to revisit this after far too many years and reimplemented each commit against the v4 codebase.

Differences in kernel map generation have something to do with the PeriodicKernelMap, disabling that leads all ResizeKernelMap tests passing. It must be something to do with the double to float changeover.

Benchmarks don't really yield any meaningful difference between this and main.

BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-VTPPCJ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Runtime=.NET 8.0  Arguments=/p:DebugType=portable

Main

Method	Mean	Error	StdDev	Ratio	Allocated	Alloc Ratio
SystemDrawing	18.562 ms	0.2120 ms	0.1983 ms	1.00	98 B	1.00
'ImageSharp, MaxDegreeOfParallelism = 1'	4.652 ms	0.0433 ms	0.0362 ms	0.25	51117 B	521.60

PR

Method	Mean	Error	StdDev	Ratio	Allocated	Alloc Ratio
SystemDrawing	18.122 ms	0.2950 ms	0.2759 ms	1.00	137 B	1.00
'ImageSharp, MaxDegreeOfParallelism = 1'	4.533 ms	0.0583 ms	0.0517 ms	0.25	52091 B	380.23

You can see my changes in this branch here.

https://github.com/SixLabors/ImageSharp/tree/js/resize-map-optimizations

JimBobSquarePants and others added 30 commits October 30, 2020 23:03

Add Shuffle4Slice3

1d21dc9

Cleanup

9f38d40

Merge branch 'master' into js/Shuffle3Channel

2421a56

fix spans directly

1b85483

Faster Pad3Shuffle4

21611e1

Faster Shuffle4Slice3

f462bfe

Update benchmark

2d1f2cc

Fast fallbacks

d5b2577

Don't cast full spans

893bfdd

Shuffle3 + Tests

76d5277

Cleanup and fix tests

49062c4

Fix Shuffle4Slice3, wire up shuffles.

8c32469

Add Rgb24 <==> Vector4 benchmarks

1f73b21

Add initial vectorized color converter implementation

210d8f7

Unroll XYZWShuffle4Slice3

a08f906

Fix shuffle +m slice fallback

4416d3d

Refactor JpegColorConverters

f421be2

Inline controls as constants

11cc6af

Refactor and add tests

7cc953e

Add benchmarks

89bb9fe

Drop FromGrayscaleVector8

d76dbaa

Polish benchmarks (fix new warnings)

82a2359

Fix #1414

5748d3d

Merge branch 'master' into af/fix-Decode_IsCancellable

73454f5

Handle Bmp encoder padding.

1ad9fcd

Update src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

e1168ad

Co-authored-by: Clinton Ingram <[email protected]>

Merge branch 'js/Shuffle3Channel' of https://github.com/SixLabors/Ima…

a46fb9b

…geSharp into js/Shuffle3Channel

Use ROS trick all round and optimize Shuffle3

74dd8cd

Merge pull request #1415 from SixLabors/af/fix-Decode_IsCancellable

cf9cc6b

Fix JpegDecoderTests.Decode_IsCancellable

Merge branch 'master' into js/Shuffle3Channel

56cfd96

Sergio0694 and others added 14 commits January 19, 2021 18:19

Add initial FMA resize kernel convolve implementation

42632c7

Switch from FMA to AVX2 instructions

874e951

Revert to FMA, codegen improvements

941e173

Add unrolled FMA loop

493d04a

Add missing indexing update

407c2d9

Workaround for incorrect codegen on .NET 5

a7ca1b0

See Vector256.Create issue: dotnet/runtime#47236

Update image threshold for resize tests

e2211c3

Merge pull request #1513 from SixLabors/sp/simd-resize-convolve

7eb5cc0

Speed improvements to resize kernel (w/ SIMD)

Switch temporary buffers to use float-s

cde8677

Add AVX path for resize kernel normalization

e780959

Minor codegen improvements

22ab161

Remove unnecessary memory zeroing

14a7423

Remo permute in resize kernel

14acad8

Sergio0694 added the area:performance label Jan 21, 2021

Sergio0694 added this to the 1.1.0 milestone Jan 21, 2021

Sergio0694 added 2 commits January 21, 2021 20:34

Fix length comparison for resize kernels in tests

2e1d612

Improve codegen in resize kernel FMA implementation

05e2c91

JimBobSquarePants closed this Feb 17, 2021

JimBobSquarePants force-pushed the master branch from db51f69 to 172c48e Compare February 17, 2021 01:43

JimBobSquarePants mentioned this pull request Aug 15, 2024

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

Draft

4 tasks

lizard-boy mentioned this pull request Nov 1, 2024

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) grepdemos/ImageSharp#3

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Sergio0694 commented Jan 21, 2021 •

edited

Loading

antonfirsov commented Jan 23, 2021

Sergio0694 commented Jan 23, 2021

antonfirsov commented Jan 28, 2021

Sergio0694 commented Jan 28, 2021

JimBobSquarePants commented Feb 17, 2021

JimBobSquarePants commented Aug 14, 2024

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Conversation

Sergio0694 commented Jan 21, 2021 • edited Loading

Prerequisites

Description

Resize convolution codegen diff

antonfirsov commented Jan 23, 2021

Sergio0694 commented Jan 23, 2021

antonfirsov commented Jan 28, 2021

Sergio0694 commented Jan 28, 2021

JimBobSquarePants commented Feb 17, 2021

JimBobSquarePants commented Aug 14, 2024

Sergio0694 commented Jan 21, 2021 •

edited

Loading