-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]Decrease Pool Size on the fly #724
Comments
cc @harrism @jrhemstad re: an offline discussion from this week |
One very expensive way you can accomplish this is to spill all of your GPU data to host, destroy the pool, allocate a new pool only large enough to hold your data, move your data back from host to device. Otherwise this is a very difficult problem that may or may not be possible with the virtual memory APIs. |
Does this just go back to PyTorch using their own memory manager? If so, I recall we raised issue ( pytorch/pytorch#43144 ) to discuss getting them to accept external memory managers (like RMM) and there was some discussion with them. Though don't know what the status was there. Mark & Jake do either of you know? |
It's still in the works. I think the idea is that it's unlikely we'll ever be able to get every single library we want to interoperate with will expose hooks for external allocators. So there will likely always be a need to defrag the pool. |
Ok thanks for the info. That's fair. Though do we need to worry about every library? PyTorch comes up frequently and there are many other libraries building off of it. Maybe solving this in PyTorch is enough for many use cases? Randy & Vibhu are there other places you see this issue or is it mainly with PyTorch? |
Even if everyone uses RMM, fragmentation is a big problem so having an explicit (possibly expensive) button to push to defragment may be valuable. |
When using Spacy it under the hood needs memory for With previous queries, we already have increased our POOL so sometimes we don't have memory left for |
Can spaCy reuse CuPy's cuBLAS context? |
Don't know enough about how they interact to answer this. Maybe @beckernick can answer this once he is back.
Wondering if we face similar problems with TensorRT too, may be @benfred from the NVtabular team can shed some light in case they face this competing pool problem with TensorRT engine and RMM? |
FWIW made a suggestion in the other thread on how to initialize the cuBLAS handle Edit: Looks like spaCy just uses CuPy AFAICT. Not seeing any direct C calls to cuBLAS. So maybe initializing cuBLAS with CuPy is sufficient |
This issue has been labeled |
Is your feature request related to a problem? Please describe.
A lot of times we use libraries that use RMM Pool (Rapids, cupy, numba) with libraries that have their own Pool( like PyTorch ). This leads to intense competition b/w the libraries for device memory.
The workflows often look like below:
Step 1: Do cudf/cupy based pre-processing. This leads to the expansion of the memory pool. Once pre-processing is complete we often have
final memory in use
<<current rmm Pool size
which is the same aspeak memory use
Step 2: Do Pytorch based inference/Training: This requires
Pytorch
to use its own Pool to run inference which competes with the RMM pool which at this point can be really large causing memory problems.The above pattern of final memory in use being less than the peak memory is very common for NLP workflows because we often go from string representation to numerical representation which leads to a decrease in memory.
Describe the solution you'd like
I wish I could decrease the RMM pool currently in use on the fly using the Python API.
Additional context
This currently can help step side some issues like :
Issue 159: rapidsai/gpu-bdb#159 (We fail some times cause we don't have enough memory for cuBlas initialization due to RMM pool expansion )
Add the hugging face/pyTorch implementation implementation to our benchmarking script.
CC: @BartleyR / @brhodes10 . This might help with #501 for cyber workflows.
CC: @EvenOldridge For inputs on the NvTabular side.
The text was updated successfully, but these errors were encountered: