[FEA]Decrease Pool Size on the fly #724

VibhuJawa · 2021-03-10T18:42:46Z

Is your feature request related to a problem? Please describe.
A lot of times we use libraries that use RMM Pool (Rapids, cupy, numba) with libraries that have their own Pool( like PyTorch ). This leads to intense competition b/w the libraries for device memory.

The workflows often look like below:

Step 1: Do cudf/cupy based pre-processing. This leads to the expansion of the memory pool. Once pre-processing is complete we often have final memory in use << current rmm Pool size which is the same as peak memory use
Step 2: Do Pytorch based inference/Training: This requires Pytorch to use its own Pool to run inference which competes with the RMM pool which at this point can be really large causing memory problems.

The above pattern of final memory in use being less than the peak memory is very common for NLP workflows because we often go from string representation to numerical representation which leads to a decrease in memory.

Describe the solution you'd like
I wish I could decrease the RMM pool currently in use on the fly using the Python API.

Additional context
This currently can help step side some issues like :

Issue 159: rapidsai/gpu-bdb#159 (We fail some times cause we don't have enough memory for cuBlas initialization due to RMM pool expansion )

Add the hugging face/pyTorch implementation implementation to our benchmarking script.

CC: @BartleyR / @brhodes10 . This might help with #501 for cyber workflows.
CC: @EvenOldridge For inputs on the NvTabular side.

The text was updated successfully, but these errors were encountered:

randerzander · 2021-03-10T18:48:40Z

cc @harrism @jrhemstad re: an offline discussion from this week

jrhemstad · 2021-03-10T19:03:49Z

I wish I could decrease the RMM pool currently in use on the fly using the Python API.

One very expensive way you can accomplish this is to spill all of your GPU data to host, destroy the pool, allocate a new pool only large enough to hold your data, move your data back from host to device.

Otherwise this is a very difficult problem that may or may not be possible with the virtual memory APIs.

jakirkham · 2021-03-10T19:14:32Z

Does this just go back to PyTorch using their own memory manager? If so, I recall we raised issue ( pytorch/pytorch#43144 ) to discuss getting them to accept external memory managers (like RMM) and there was some discussion with them. Though don't know what the status was there. Mark & Jake do either of you know?

jrhemstad · 2021-03-10T19:17:30Z

Does this just go back to PyTorch using their own memory manager? If so, I recall we raised issue ( pytorch/pytorch#43144 ) to discuss getting them to accept external memory managers (like RMM) and there was some discussion with them. Though don't know what the status was there. Mark & Jake do either of you know?

It's still in the works. I think the idea is that it's unlikely we'll ever be able to get every single library we want to interoperate with will expose hooks for external allocators. So there will likely always be a need to defrag the pool.

jakirkham · 2021-03-10T19:29:08Z

Ok thanks for the info.

That's fair. Though do we need to worry about every library? PyTorch comes up frequently and there are many other libraries building off of it. Maybe solving this in PyTorch is enough for many use cases?

Randy & Vibhu are there other places you see this issue or is it mainly with PyTorch?

harrism · 2021-03-10T19:32:55Z

Even if everyone uses RMM, fragmentation is a big problem so having an explicit (possibly expensive) button to push to defragment may be valuable.

VibhuJawa · 2021-03-10T19:34:04Z

Randy & Vibhu are there other places you see this issue or is it mainly with PyTorch?

When using Spacy it under the hood needs memory for cuBLAS context (rapidsai/gpu-bdb#159 (comment).

With previous queries, we already have increased our POOL so sometimes we don't have memory left for cuBlas context creation which leads to intermittent failures, if we had this feature we could have decreased the Pool in use on the fly and side step this issue.

jakirkham · 2021-03-10T19:42:53Z

Can spaCy reuse CuPy's cuBLAS context?

VibhuJawa · 2021-03-10T20:08:36Z

Can spaCy reuse CuPy's cuBLAS context?

Don't know enough about how they interact to answer this. Maybe @beckernick can answer this once he is back.

Randy & Vibhu are there other places you see this issue or is it mainly with PyTorch?

Wondering if we face similar problems with TensorRT too, may be @benfred from the NVtabular team can shed some light in case they face this competing pool problem with TensorRT engine and RMM?

jakirkham · 2021-03-10T20:11:04Z

FWIW made a suggestion in the other thread on how to initialize the cuBLAS handle

Edit: Looks like spaCy just uses CuPy AFAICT. Not seeing any direct C calls to cuBLAS. So maybe initializing cuBLAS with CuPy is sufficient

github-actions · 2021-04-09T21:06:13Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

VibhuJawa added ? - Needs Triage Need team to review and classify feature request New feature or request labels Mar 10, 2021

jrhemstad added improvement Improvement / enhancement to an existing function and removed ? - Needs Triage Need team to review and classify labels Mar 10, 2021

github-actions bot added the inactive-30d label Apr 9, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Apr 9, 2021

jarmak-nv added this to RMM Project Board Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]Decrease Pool Size on the fly #724

[FEA]Decrease Pool Size on the fly #724

VibhuJawa commented Mar 10, 2021 •

edited

Loading

randerzander commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

jakirkham commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

jakirkham commented Mar 10, 2021

harrism commented Mar 10, 2021

VibhuJawa commented Mar 10, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021 •

edited

Loading

VibhuJawa commented Mar 10, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021 •

edited

Loading

github-actions bot commented Apr 9, 2021

[FEA]Decrease Pool Size on the fly #724

[FEA]Decrease Pool Size on the fly #724

Comments

VibhuJawa commented Mar 10, 2021 • edited Loading

randerzander commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

jakirkham commented Mar 10, 2021

jrhemstad commented Mar 10, 2021

jakirkham commented Mar 10, 2021

harrism commented Mar 10, 2021

VibhuJawa commented Mar 10, 2021 • edited Loading

jakirkham commented Mar 10, 2021 • edited Loading

VibhuJawa commented Mar 10, 2021 • edited Loading

jakirkham commented Mar 10, 2021 • edited Loading

github-actions bot commented Apr 9, 2021

VibhuJawa commented Mar 10, 2021 •

edited

Loading

VibhuJawa commented Mar 10, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021 •

edited

Loading

VibhuJawa commented Mar 10, 2021 •

edited

Loading

jakirkham commented Mar 10, 2021 •

edited

Loading