Is there some internal locking going on inside of Skia or SkiaSharp? #2857

bforl · 2024-05-12T17:53:08Z

bforl
May 12, 2024

I have been experimenting with performing SkiaSharp rendering across multiple threads.

The idea being that I want to render 8 different canvas' as fast as I can. So I figured that as Skia is thread safe I can just render each canvas in its own thread. So this should be one of those 'embarrassingly parallel' problems, right?

Well, it works ... but the timings are not what I would expect.

First, here is the results processing all 8 sequentially (no parallelism)

You can see I have 25 iterations to "warm" things up, and you can see that each canvas takes about 22ms to complete.

Secondly, here is the same thing, but instead run using a parallel foreach

The total is faster, but not as fast as I would expect, oddly, now each canvas takes much longer to complete? they have gone from 22ms to 60ms? which sounds to me like there is some resource contention (locking?) going on?

Any thoughts on what might cause this?

For reference you can uncomment the SpinWait and comment out the render to get an idea of what kind of parallelism you should get. On my 4 core (8 logical) machine, I expect to see a speed up of 4x - 8x (which is what I see with the spin wait)

Here is the full code

internal class Program
{
    public const int ImageSize = 800;

    public class Item
    {
        public SKPoint[] Points { get; }
        public double RenderTime { get; set; }

        public Item(int count)
        {
            var points = new SKPoint[count];
            var r = new Random();

            for (int i = 0; i < points.Length; ++i)
            {
                points[i].X = r.NextSingle() * ImageSize;
                points[i].Y = r.NextSingle() * ImageSize;
            }

            Points = points;
        }

        public void Render(SKCanvas canvas)
        {
            canvas.Clear();

            var paint = new SKPaint()
            {
                StrokeCap = SKStrokeCap.Square,
                StrokeWidth = 4,
                Color = new SKColor(0, 50, 128, 5)
            };

            canvas.DrawPoints(SKPointMode.Points, Points, paint);
        }
    }

    public class Test
    {
        public void Run(int index)
        {
            var items = Enumerable.Range(0, 8).Select(e => new Item(500_000)).ToArray();

            var sw = Stopwatch.StartNew();
            Parallel.ForEach(items, new ParallelOptions() { MaxDegreeOfParallelism = -1 /* 1 for single thread */ }, (item) =>
            {
                Render(item, ImageSize, ImageSize);
            });

            sw.Stop();

            Console.WriteLine($"Test iteration {index:00} : " + string.Join(", ", 
                items.Select(e => $"{e.RenderTime:0.0}")) + $"    # Total {sw.Elapsed.TotalMilliseconds:0.0}(ms)");
        }

        public unsafe void Render(Item item, int width, int height)
        {
            var buffer = new int[width * height];
            int stride = width * 4;

            fixed (int* bufferPointer = buffer)
            {
                var sKImageInfo = new SKImageInfo(width, height);
                using (SKSurface sKSurface = SKSurface.Create(sKImageInfo, (nint)bufferPointer, stride))
                {
                    var sw = Stopwatch.StartNew();

                    //Thread.SpinWait(500000);
                    item.Render(sKSurface.Canvas);
                    
                    sw.Stop();

                    item.RenderTime = sw.Elapsed.TotalMilliseconds;
                }
            }
        }
    }

    static void Main(string[] args)
    {
        for (int i = 0; i < 25; ++i)
        {
            Test test = new Test();

            test.Run(i);
        }

        Console.ReadLine();
    }
}

TwinkyDaniel · 2024-05-13T06:43:37Z

TwinkyDaniel
May 13, 2024

Didn't try but looking at the code you should make sure to dispose all disposable skia objects. Otherwise they will get disposed using the finalizer which is basically THE global lock you are looking for.

1 reply

bforl May 13, 2024
Author

The only disposable I could see that I was missing was SKPaint, so changed it to :

public void Render(SKCanvas canvas)
 {
     canvas.Clear();

     using var paint = new SKPaint()
     {
         StrokeCap = SKStrokeCap.Square,
         StrokeWidth = 4,
         Color = new SKColor(0, 50, 128, 5)
     };

     canvas.DrawPoints(SKPointMode.Points, Points, paint);
 }

But it didn't make a difference :(

nateglasser · 2024-05-15T02:52:44Z

nateglasser
May 15, 2024

I am trying to do something similar, and your question made me nervous about my own project, so I did some profiling. There are some gotchas in your test program:

Parallel.ForEach is not guaranteed to spawn the number of threads you request [source]

Conversely, by default, the Parallel.ForEach and Parallel.For methods can use a variable number of tasks. That's why, for example, the ParallelOptions class has a MaxDegreeOfParallelism property instead of a "MinDegreeOfParallelism" property. The idea is that the system can use fewer threads than requested to process a loop.

The .NET thread pool adapts dynamically to changing workloads by allowing the number of worker threads for parallel tasks to change over time. At run time, the system observes whether increasing the number of threads improves or degrades overall throughput and adjusts the number of worker threads accordingly.

You might need to increase your iteration count before the runtime starts the number of threads you expect. You should double check what it decides to do (e.g. with Visual Studio's Diagnostic Tools).

The garbage collector is interfering here. You are doing many large allocations of SKPoint[], which immediately become eligible for GC collection, and that can stall your worker threads. You should either hold your Item references so they don't get collected, or pause GC collection with GC.TryStartNoGCRegion
Profiling reveals one of the bottlenecks is how you generate your Item array. Your calls to Random.NextSingle() are not parallelized, and they're taking up almost as much time as rendering to the canvas. Exaggerating your test app for an array of 2000 Item instances maxes out my CPU cores during the Parallel.ForEach segment, but stalls here because it's not part of your parallelization: var items = Enumerable.Range(0, 2000).Select(e => new Item(500_000)).ToArray();

4 replies

bforl May 15, 2024
Author

I was wasn't requesting a specific number of threads, I was just using -1 (and yes I appreciate it will only use as many workers as are available at that specific time.
The points are constructed before the timings, not during.
Same as above, the points are generated outside of the timings/render

The point still stands though, if you replace item.Render(sKSurface.Canvas) with Thread.SpinWait(500000);, the timings scale so much better. And that still includes the allocation of the buffer. So the problem has to be in Skia

bforl May 15, 2024
Author

To further prove the point, here is the code modified so that all user allocations (obviously Skia will still do its own) are done before the tests run.

Same results.

internal class Program
{
    public const int ImageSize = 800;

    public class Item
    {
        public SKPoint[] Points { get; }
        public double RenderTime { get; set; }
        public int[] Buffer { get; }
        public int Stride { get; }

        public Item(int count)
        {
            Buffer = new int[ImageSize * ImageSize];
            Stride = ImageSize * 4;

            var points = new SKPoint[count];
            var r = new Random();

            for (int i = 0; i < points.Length; ++i)
            {
                points[i].X = r.NextSingle() * ImageSize;
                points[i].Y = r.NextSingle() * ImageSize;
            }

            Points = points;
        }

        public void Render(SKCanvas canvas)
        {
            canvas.Clear();

            var paint = new SKPaint()
            {
                StrokeCap = SKStrokeCap.Square,
                StrokeWidth = 4,
                Color = new SKColor(0, 50, 128, 5)
            };

            canvas.DrawPoints(SKPointMode.Points, Points, paint);
        }
    }

    public class Test
    {
        public void Run(int index, Item[] items)
        {
            var sw = Stopwatch.StartNew();
            Parallel.ForEach(items, new ParallelOptions() { MaxDegreeOfParallelism = -1 /* 1 for single thread */ }, (item) =>
            {
                Render(item, ImageSize, ImageSize);
            });

            sw.Stop();

            Console.WriteLine($"Test iteration {index:00} : " + string.Join(", ",
                items.Select(e => $"{e.RenderTime:0.0}")) + $"    # Total {sw.Elapsed.TotalMilliseconds:0.0}(ms)");
        }

        public unsafe void Render(Item item, int width, int height)
        {
            fixed (int* bufferPointer = item.Buffer)
            {
                var sKImageInfo = new SKImageInfo(width, height);
                using (SKSurface sKSurface = SKSurface.Create(sKImageInfo, (nint)bufferPointer, item.Stride))
                {
                    var sw = Stopwatch.StartNew();

                    //Thread.SpinWait(500000);
                    item.Render(sKSurface.Canvas);

                    sw.Stop();

                    item.RenderTime = sw.Elapsed.TotalMilliseconds;
                }
            }
        }
    }

    static void Main(string[] args)
    {
        var items = Enumerable.Range(0, 8).Select(e => new Item(500_000)).ToArray();

        for (int i = 0; i < 25; ++i)
        {
            Test test = new Test();

            test.Run(i, items);
        }

        Console.ReadLine();
    }
}

nateglasser May 15, 2024

Hi sorry, yes I misunderstood (and confused myself in my reply as I was tweaking your benchmark).

I think you are seeing cache misses on the processor. When each thread uses its own separate points array, then I see rapidly diminishing returns. If I shrink your image dimensions and use a shared static array of points (so that all cores are using the same memory), then the timings scale with the number of threads as expected. Do you see the same on your system?

Your benchmark is using at least 6.25 MB per core (4-byte 800x800 RGBA image plus a 500000 8-byte point array). I don't know your processor, but for reference the latest Intel i7 has only 24 MB of L3 cache to share across all cores. AMD chips may have more but probably still not enough. When you exceed that, additional threads won't necessarily make things faster if they are fighting each other for the cache, and so I think you are measuring a memory/cache bottleneck (and why a spinlock scales as expected -- it doesn't need the cache).

bforl May 15, 2024
Author

I see, that makes sense and matches what I see when I play with the amount of memory required.

I can remove Skia from the entire equation and still see poor scaling when the memory required is doesn't fit in the L3 Cache. Something I hadn't even considered.

Thanks for explaining

artemiusgreat · 2024-10-26T04:20:19Z

artemiusgreat
Oct 26, 2024

I remember trying to use Skia from multiple threads and it was failing, so this is weird that now you're able to safely call it inside Parallel.
Meanwhile, couple of suggestions:

Parallel may create tasks or pick threads from the pool. Both task creation and keeping all threads busy may create additional delays, so I would try to repeat the same test where each canvas is rendered inside its own thread that was pre-created before the work started.
I was working on real-time stock charts and to make them flexible and composable, each part was rendered in its own thread, e.g. 4 threads for each scale (top, bottom, right, left) and one for main area with bars and indicators, so created thread-safe task runner to do rendering in the background threads. It worked fast and without locks or any kind of synchronization but was exhausting all CPU resources. I abandoned idea of having too many threads but having one background thread was an acceptable compromise. Example of call.

Conclusion: You can optimize some parts of Skia but it was built to draw shapes one by one, not parallel as OpenGL on GPU, which means that you will lose most of processing time on array iterations or memory allocation.

1 reply

bforl Oct 26, 2024
Author

Some good points.

I do have some threading going on now, however its tricky because the charts are interactive (user can zoom, pan, scroll etc), and we have a dashboard of them.

My current solution is to do all rendering on the main thread when the user is actively interacting with a chart (this keeps things as responsive as possible), else do it on background tasks (e.g resizing, chart updates etc). Seems to be a reasonable compromise.

However, we do everything on the CPU. It just seems to be much faster than using SKIA on the GPU.

E.g a Dashboard with 16 charts (16 SKCanvas instances) seems to be much more performant when in CPU mode.

I guess ideally, you would have one SkCanvas, and manually split it into 16 sections. So that you can have one GPU context instead of 16. But that makes things much more complicated in terms of interactivity and maintainability.

For reference, this is all ScottPlot in WPF https://scottplot.net/

ToolmakerSteve · 2024-11-10T00:01:51Z

ToolmakerSteve
Nov 10, 2024

@bforl - are multiple canvases dynamically changing at the same time?

If not, record everything needed for a temporarily-static canvas into an SKPicture? That should run well in an independent background thread. I think that step would just run on cpu. Then should render quickly on GPU, when needed.

Not parallelism, but minimize amount of work needed each frame.

tl;dr Have you tested with SkiaSharp 3 Preview?

A thought that popped into my mind when you mention the increased time of each task: perhaps something in SkiaSharp is causing extra task switch per canvas per frame. Every time a task is suspended, it forgoes the remainder of the current task slice. About 20 ms per slice. Multiple tasks can run in a slice, but no single task will run twice in a single slice, AFAIK. Every task slice that a given task doesn't get a chance to run, or is unable to finish, is another 20 ms time passed. I've seen similar behavior when starving the cpu. I see times like "77ms". That means 4 time slices passed before that task finished its work.

Definitely sounds like there is some code that SkiaSharp runs single-threaded. Or some contended resource.

I encounter similar slowdowns using DrawnUI, which runs on SkiaSharp on Maui. Not trying to multi-thread; just going for maximum frame rate, with many Bitmaps moving/resizing on each frame.

I have an all-SkiaSharp game. Encountered Maui bugs when added a Maui UI overlay. I switched to DrawnUI, and use its equivalents to Maui controls, which draw on to the SkiaSharp canvas. Much happier - no more dependency on native platform GUIs. This is a GUI breakthrough I have been waiting for, for years.

Discussing with taublast [DrawnUI creator] in one of my closed DrawnUI issues: situation will be greatly improved after migration of DrawnUI to SkiaSharp 3: then DrawnUI rendering logic can move to background threads. Currently, it needs to be done on UI MainThread.

Have you tested with SkiaSharp 3 Preview?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there some internal locking going on inside of Skia or SkiaSharp? #2857

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there some internal locking going on inside of Skia or SkiaSharp? #2857

bforl May 12, 2024

Replies: 4 comments · 6 replies

TwinkyDaniel May 13, 2024

bforl May 13, 2024 Author

nateglasser May 15, 2024

bforl May 15, 2024 Author

bforl May 15, 2024 Author

nateglasser May 15, 2024

bforl May 15, 2024 Author

artemiusgreat Oct 26, 2024

bforl Oct 26, 2024 Author

ToolmakerSteve Nov 10, 2024

bforl
May 12, 2024

Replies: 4 comments 6 replies

TwinkyDaniel
May 13, 2024

bforl May 13, 2024
Author

nateglasser
May 15, 2024

bforl May 15, 2024
Author

bforl May 15, 2024
Author

bforl May 15, 2024
Author

artemiusgreat
Oct 26, 2024

bforl Oct 26, 2024
Author

ToolmakerSteve
Nov 10, 2024