Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] accelerate launcher: add numa affiinities control #2241

Closed
stas00 opened this issue Dec 13, 2023 · 20 comments · Fixed by #2535
Closed

[feature request] accelerate launcher: add numa affiinities control #2241

stas00 opened this issue Dec 13, 2023 · 20 comments · Fixed by #2535
Assignees

Comments

@stas00
Copy link
Contributor

stas00 commented Dec 13, 2023

Feature request

As explained here pytorch/pytorch#115305 when using 2-cpu nodes it's important to get the NUMA affinities right to avoid cross NUMA node talk

As torchrun currently doesn't support it a workaround was posted here pytorch/pytorch#115305 (comment) and it includes a torchrun flag --no-python which the accelerate launcher doesn't have.

So any suggestions to how I could use this script with accelerate?

For simplicity here is the solution for torchrun:

trampoline.sh

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"
torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

update: I shared @yifuwang's workaround at https://twitter.com/StasBekman/status/1734724979496018215

If pynvml dependency is OK someone posted a python solution:
https://github.com/NVIDIA/DeepLearningExamples/blob/9dd9fcb98f56187e49c5ee280cf8dbd530dde57b/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py
so that would probably be easier to integrate into the launcher.

Thanks.

@muellerzr muellerzr self-assigned this Dec 13, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@stas00
Copy link
Contributor Author

stas00 commented Jan 12, 2024

keepalive

@huggingface huggingface deleted a comment from github-actions bot Feb 7, 2024
@stas00
Copy link
Contributor Author

stas00 commented Feb 7, 2024

keepalive

@huggingface huggingface deleted a comment from github-actions bot Mar 3, 2024
@muellerzr
Copy link
Collaborator

Will be looking into adding this week finally :)

@stas00
Copy link
Contributor Author

stas00 commented Mar 4, 2024

that's exciting, Zach - thank you!

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 4, 2024

@stas00 actually wait, can't we just do this?

accelerate launch --no_python --multi_gpu --num_processes 8 --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

(I just pulled all this from accelerate launch -h)

Or am I missing something?...

Or, with a config file:

accelerate launch --no_python --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 4, 2024

Note I'll also be making a PR to make you able to do both - and _ for params I think, since that seems to be the root cause of a lot of confusion with our CLI 😓

Edit; #2525 will make it possible to do this OOTB with no arg fixes, as the root cause was - vs _

@stas00
Copy link
Contributor Author

stas00 commented Mar 4, 2024

if it works - then fantastic - let's document this then please and ideally move --no_python to the end just before the no-python code as it's easier to comprehend the connection (IMHO), that is:

accelerate launch --multi_gpu --num_processes 8 --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"
accelerate launch --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"

@stas00
Copy link
Contributor Author

stas00 commented Mar 4, 2024

Zach, I think the trouble with the workaround solution is that the user won't have trampoline.sh and they would have to get it from somewhere for each new setup.

I think a much better solution would be for the framework to have an independent solution that is provided by its core.

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 5, 2024

Sure @stas00, I can agree on that and we can extend launch probably to help. I'm unfamiliar with numactl so any assistance in explaining so I can wrap my head around what's happening here would help. Let's stick with torchrun.

So we have the bash script:

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"

And the command:

torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

So torchrun will execute trampoline.sh, and we pipe in the python3 part to numctl via the "$@" there.

Or, with assuming environment is setup properly, we are running:

torchrun \
  --nproc_per_node=8 \
  --monitor-interval=1 \
  --no-python \
  numactl \
  --cpunodebind=$NODE \
  --membind=$NODE \
  python3 -c "print('hello')"

Is this a valid understanding of what we have going on?

@muellerzr
Copy link
Collaborator

Tbh though, the pynvml solution makes more sense, we can add it as a CLI option and just raise an err if it's not installed. Let me work on that real quick

@stas00
Copy link
Contributor Author

stas00 commented Mar 5, 2024

Is this a valid understanding of what we have going on?

That looks correct. the NODE is the numa node, of course.

and you need to check if numactl exists - as it's not normally installed on Linux. Need to install it.

re: pynvml

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 5, 2024

It is not, looks like we'll need to do it the hard way without pynvml (and just run a series of bash things) given that.

@stas00
Copy link
Contributor Author

stas00 commented Mar 5, 2024

but surely what bash is doing can be done in python and python has the numa API functionality - which is the pynvml script w/o the pynvml code in it.

So I think it'd be much cleaner and user-friendlier not to do it at CLI level.

@stas00
Copy link
Contributor Author

stas00 commented Mar 5, 2024

re-checked, it's NVIDIA only:

pynvml - Python bindings to the NVIDIA Management Library.

@muellerzr
Copy link
Collaborator

No worries, while un-fun, I'm getting it working with some subprocess ;)

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 5, 2024

@stas00 if you want to try some bleeding edge stuff, just pushed some commits. Haven't fully tested it on a multi-gpu system yet, but at least the dry run of the commands looks like everything should have been setup properly:

pip install git+https://github.com/huggingface/accelerate@affinity

To use:

accelerate launch --multi_gpu --num_processes 8 --enable_numa_affinity myscript.py --arg1=1 ...

I'll be able to fully test in the AM :) (and enable it via config file, etc, etc)

@stas00
Copy link
Contributor Author

stas00 commented Mar 5, 2024

hmm, I wasn't paying attention to the bash callouts - but looking now closely it's still nvidia-dependent because of BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader) - so for AMD it'd be rocm-smi (need to check the args) and for gaudi2 I don't know - we should ask the optimum folks.

So the subprocess callout is probably simpler anyway, than making accelerate depend on pynvml, and various other libraries for each architecture? It might be inevitable in the long run as the number of vendors explodes, but for now it's probably easier to first figure out from pytorch which vendor it is and then "switch" into the corresponding busid code.

@muellerzr
Copy link
Collaborator

muellerzr commented Mar 7, 2024

Let's start small with the nvidia version, then we can add the AMD and gaudi2 as follow ups. (Since we can only test the nvidia-smi version rn)

@muellerzr
Copy link
Collaborator

@stas00 please see #2535 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants