-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core #6
base: master
Are you sure you want to change the base?
Core #6
Conversation
- Excluded unused files. -Removed configuration required only on Windows.
I apologize for the delay, I had a very busy week. I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 200020008 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast. I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views. In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large. I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls. At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code. Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux. |
Hi Kevin,
No problem.
The profiler is telling that GEN 0 is getting full and being collected more than 100 times / s. I think there is some overhead in heap allocations.
NpyArray_ITER_NEXT has many branches indeed, but I copy an older serial version of your MatrixProduct and got interesting results. The serial version is twice as fast.
As you can see in the previous image, the
My curiosity was to see if it would be possible to beat Numpy's numbers with a pure C# implementation. I think JIT can get close enough to become competitive. For the past few days, I have been messing with your code trying to understand how it is architected. What are your plans for the library architecture? Will you stay close to Numpy architecture or move to a more object-oriented one? What were your criteria when you ported the Numpy code? Is all C code in the numpyinternal class? What is the role of the numpyAPI class?
Only the WPF examples do not run on Linux and only a few tests that involve dynamic assembly emitting have had to be disabled because dotnet 3.1 does not support some APIs. |
Did you consider a version of ndarray using generics? |
Can we communicate directly by email rather than through github? I think we can do a better job sharing information than this tool allows. [email protected] |
I think it is possible that when I converted MatrixProduct to parallel processing, I slowed it down for cases where there is only one thread. I do add some overhead to manage keeping it parallel. It definitely is faster if there is more than one thread that can be employed. Your test case only uses one thread however.
I can probably change the code to check for multiple threads and branch to the old code if not. That would be a nice optimization.
From: Tony Alexander Hild <[email protected]>
Sent: Tuesday, September 29, 2020 1:37 PM
To: Quansight-Labs/numpy.net <[email protected]>
Cc: KevinBaselinesw <[email protected]>; Comment <[email protected]>
Subject: Re: [Quansight-Labs/numpy.net] Core (#6)
Hi Kevin,
I apologize for the delay, I had a very busy week.
No problem.
I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 2000_2000_8 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.
The profiler is telling that GEN 0 is getting full and being collected more than 100 times / s. I think there is some overhead in heap allocations.
I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.
In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.
I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.
NpyArray_ITER_NEXT has many branches indeed, but I copy an older serial version of your MatrixProduct and got interesting results. The serial version is twice as fast.
#Parallel#
Running Mackey...
Loading...
Elapsed: 91ms
Constructing ESN...
Elapsed: 2588ms
Fit...
Elapsed: 32129ms
Predict...
Elapsed: 20501ms
Error...
STRING
{ test error:
0,13960390995923377 }
Elapsed: 43ms
Total time: 55364ms
#Serial#
Running Mackey...
Loading...
Elapsed: 96ms
Constructing ESN...
Elapsed: 2584ms
Fit...
Elapsed: 15478ms
Predict...
Elapsed: 3931ms
Error...
STRING
{ test error:
0,13960390995923377 }
Elapsed: 31ms
Total time: 22128ms
<https://user-images.githubusercontent.com/338795/94589292-7bf70d80-025b-11eb-9e05-3bedab2cd815.jpeg>
As you can see in the previous image, the _update takes up 71% of the time and TaskReplication takes up 67% of the time.
On the other hand, in the serial version, the _update takes up only 31% of the time. The bottleneck here is MathNet.Numerics.
<https://user-images.githubusercontent.com/338795/94589293-7d283a80-025b-11eb-90eb-72234585838c.jpeg>
At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.
My curiosity was to see if it would be possible to beat Numpy's numbers with a pure C # implementation. I think JIT can get close enough to become competitive. For the past few days, I have been messing with your code trying to understand how it is architected. What are your plans for the library architecture? Will you stay close to Numpy architecture or move to a more object-oriented one? What were your criteria when you ported the Numpy code? Is all C code in the numpyinternal class? What is the role of the numpyAPI class?
Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.
Only the WPF examples do not run on Linux and only a few tests that involve dynamic assembly emitting have had to be disabled because dotnet 3.1 does not support some APIs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#6 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACP4GWR2YDVBC6Q563V2M7DSIILMVANCNFSM4RUCTYZQ> . <https://github.com/notifications/beacon/ACP4GWTSMWRP7YENAAD7T33SIILMVA5CNFSM4RUCTYZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFHDGZZQ.gif>
|
The code was running with 8 threads on average. |
Hi Kevin,
To test on Linux, I had to update to .Net Core 3.1. I had to disable some tests and others are breaking. If this update does not break your dev environment, you may want to consider applying it.