-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve launch latency #2456
Labels
Comments
maleadt
added
performance
How fast can we go?
and removed
bug
Something isn't working
labels
Aug 22, 2024
maleadt
changed the title
Delays between HtoD memcopy and kernel launch
Improve launch latency
Aug 22, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For a fairly simple minimal working example shown below, I see a considerably large delay between the HtoD memcopy and the kernel launch.
The times reported by Nsight systems are:
On further inspection of
cufunction()
insrc/compiler/execution.jl
using NVTX annotations, it seems like the calls tomethodinstance
andcached_compilation
take more time than expected.Reference discussion in the Julia discourse here.
MWE
Expected behavior
I expect the delay between memcopy and kernel launch to be around two orders of magnitude lower.
Project.toml
Manifest.toml
Version info
Details on Julia:
Details on CUDA:
Additional context
The Nsight systems profiling output is shown below:
The same MWE was also run on a P100 GPU and similar trends were seen although each block took shorter times.
For other larger cases, a large delay between kernel completion and DtoH memcopy was also noticed. However, this may be unrelated.
The text was updated successfully, but these errors were encountered: