[feature request] accelerate launcher: add numa affiinities control #2241

stas00 · 2023-12-13T00:04:56Z

Feature request

As explained here pytorch/pytorch#115305 when using 2-cpu nodes it's important to get the NUMA affinities right to avoid cross NUMA node talk

As torchrun currently doesn't support it a workaround was posted here pytorch/pytorch#115305 (comment) and it includes a torchrun flag --no-python which the accelerate launcher doesn't have.

So any suggestions to how I could use this script with accelerate?

For simplicity here is the solution for torchrun:

trampoline.sh

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"

torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

update: I shared @yifuwang's workaround at https://twitter.com/StasBekman/status/1734724979496018215

If pynvml dependency is OK someone posted a python solution:
https://github.com/NVIDIA/DeepLearningExamples/blob/9dd9fcb98f56187e49c5ee280cf8dbd530dde57b/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py
so that would probably be easier to integrate into the launcher.

Thanks.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-12T15:06:10Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 · 2024-01-12T18:40:54Z

keepalive

stas00 · 2024-02-07T16:30:39Z

keepalive

muellerzr · 2024-03-03T15:20:22Z

Will be looking into adding this week finally :)

stas00 · 2024-03-04T05:05:28Z

that's exciting, Zach - thank you!

muellerzr · 2024-03-04T15:36:58Z

@stas00 actually wait, can't we just do this?

accelerate launch --no_python --multi_gpu --num_processes 8 --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

(I just pulled all this from accelerate launch -h)

Or am I missing something?...

Or, with a config file:

accelerate launch --no_python --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

muellerzr · 2024-03-04T16:55:49Z

Note I'll also be making a PR to make you able to do both - and _ for params I think, since that seems to be the root cause of a lot of confusion with our CLI 😓

Edit; #2525 will make it possible to do this OOTB with no arg fixes, as the root cause was - vs _

stas00 · 2024-03-04T20:01:16Z

if it works - then fantastic - let's document this then please and ideally move --no_python to the end just before the no-python code as it's easier to comprehend the connection (IMHO), that is:

accelerate launch --multi_gpu --num_processes 8 --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"
accelerate launch --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"

stas00 · 2024-03-04T23:00:37Z

Zach, I think the trouble with the workaround solution is that the user won't have trampoline.sh and they would have to get it from somewhere for each new setup.

I think a much better solution would be for the framework to have an independent solution that is provided by its core.

muellerzr · 2024-03-05T00:56:45Z

Sure @stas00, I can agree on that and we can extend launch probably to help. I'm unfamiliar with numactl so any assistance in explaining so I can wrap my head around what's happening here would help. Let's stick with torchrun.

So we have the bash script:

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"

And the command:

torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

So torchrun will execute trampoline.sh, and we pipe in the python3 part to numctl via the "$@" there.

Or, with assuming environment is setup properly, we are running:

torchrun \
  --nproc_per_node=8 \
  --monitor-interval=1 \
  --no-python \
  numactl \
  --cpunodebind=$NODE \
  --membind=$NODE \
  python3 -c "print('hello')"

Is this a valid understanding of what we have going on?

muellerzr · 2024-03-05T01:04:22Z

Tbh though, the pynvml solution makes more sense, we can add it as a CLI option and just raise an err if it's not installed. Let me work on that real quick

stas00 · 2024-03-05T01:23:46Z

Is this a valid understanding of what we have going on?

That looks correct. the NODE is the numa node, of course.

and you need to check if numactl exists - as it's not normally installed on Linux. Need to install it.

re: pynvml

I'm pretty sure it's NVIDIA only - i.e. it won't work on anything but NVIDIA - as AMD MI300X and Intel Gaudi2 are emerging this solution won't work there, but numactl will
beware that pynvml ignores CUDA_VISIBLE_DEVICES - where I am using it I had to use a workaround: https://github.com/stas00/ipyexperiments/blob/569a450e204b1da60e5ef07a96c91553b286ea14/ipyexperiments/utils/mem.py#L33-L44
and is Accelerate linux only? to work on osx I had to do https://github.com/stas00/ipyexperiments/blob/569a450e204b1da60e5ef07a96c91553b286ea14/ipyexperiments/utils/pynvml_gate.py#L5-L15

muellerzr · 2024-03-05T01:47:15Z

It is not, looks like we'll need to do it the hard way without pynvml (and just run a series of bash things) given that.

stas00 · 2024-03-05T01:56:16Z

but surely what bash is doing can be done in python and python has the numa API functionality - which is the pynvml script w/o the pynvml code in it.

So I think it'd be much cleaner and user-friendlier not to do it at CLI level.

stas00 · 2024-03-05T01:57:10Z

re-checked, it's NVIDIA only:

pynvml - Python bindings to the NVIDIA Management Library.

muellerzr · 2024-03-05T01:58:17Z

No worries, while un-fun, I'm getting it working with some subprocess ;)

muellerzr · 2024-03-05T02:54:26Z

@stas00 if you want to try some bleeding edge stuff, just pushed some commits. Haven't fully tested it on a multi-gpu system yet, but at least the dry run of the commands looks like everything should have been setup properly:

pip install git+https://github.com/huggingface/accelerate@affinity

To use:

accelerate launch --multi_gpu --num_processes 8 --enable_numa_affinity myscript.py --arg1=1 ...

I'll be able to fully test in the AM :) (and enable it via config file, etc, etc)

stas00 · 2024-03-05T16:43:56Z

hmm, I wasn't paying attention to the bash callouts - but looking now closely it's still nvidia-dependent because of BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader) - so for AMD it'd be rocm-smi (need to check the args) and for gaudi2 I don't know - we should ask the optimum folks.

So the subprocess callout is probably simpler anyway, than making accelerate depend on pynvml, and various other libraries for each architecture? It might be inevitable in the long run as the number of vendors explodes, but for now it's probably easier to first figure out from pytorch which vendor it is and then "switch" into the corresponding busid code.

muellerzr · 2024-03-07T12:21:36Z

Let's start small with the nvidia version, then we can add the AMD and gaudi2 as follow ups. (Since we can only test the nvidia-smi version rn)

muellerzr · 2024-03-07T15:20:39Z

@stas00 please see #2535 :)

muellerzr self-assigned this Dec 13, 2023

huggingface deleted a comment from github-actions bot Feb 7, 2024

huggingface deleted a comment from github-actions bot Mar 3, 2024

muellerzr mentioned this issue Mar 3, 2024

CPU threads binding on NUMA nodes #2371

Closed

muellerzr mentioned this issue Mar 4, 2024

Enable using dash or underscore in CLI #2525

Closed

5 tasks

muellerzr mentioned this issue Mar 5, 2024

Enable using dash or underscore for CLI args #2527

Merged

5 tasks

muellerzr mentioned this issue Mar 7, 2024

Add NUMA affinity control for NVIDIA GPUs #2535

Merged

5 tasks

muellerzr closed this as completed in #2535 Mar 20, 2024

stas00 mentioned this issue Mar 27, 2024

add torch.distributed.get_local_rank pytorch/pytorch#122816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] accelerate launcher: add numa affiinities control #2241

[feature request] accelerate launcher: add numa affiinities control #2241

stas00 commented Dec 13, 2023 •

edited

Loading

github-actions bot commented Jan 12, 2024

stas00 commented Jan 12, 2024

stas00 commented Feb 7, 2024

muellerzr commented Mar 3, 2024

stas00 commented Mar 4, 2024

muellerzr commented Mar 4, 2024 •

edited

Loading

muellerzr commented Mar 4, 2024 •

edited

Loading

stas00 commented Mar 4, 2024

stas00 commented Mar 4, 2024

muellerzr commented Mar 5, 2024 •

edited

Loading

muellerzr commented Mar 5, 2024

stas00 commented Mar 5, 2024 •

edited

Loading

muellerzr commented Mar 5, 2024 •

edited

Loading

stas00 commented Mar 5, 2024 •

edited

Loading

stas00 commented Mar 5, 2024

muellerzr commented Mar 5, 2024

muellerzr commented Mar 5, 2024 •

edited

Loading

stas00 commented Mar 5, 2024

muellerzr commented Mar 7, 2024 •

edited

Loading

muellerzr commented Mar 7, 2024

[feature request] accelerate launcher: add numa affiinities control #2241

[feature request] accelerate launcher: add numa affiinities control #2241

Comments

stas00 commented Dec 13, 2023 • edited Loading

Feature request

github-actions bot commented Jan 12, 2024

stas00 commented Jan 12, 2024

stas00 commented Feb 7, 2024

muellerzr commented Mar 3, 2024

stas00 commented Mar 4, 2024

muellerzr commented Mar 4, 2024 • edited Loading

muellerzr commented Mar 4, 2024 • edited Loading

stas00 commented Mar 4, 2024

stas00 commented Mar 4, 2024

muellerzr commented Mar 5, 2024 • edited Loading

muellerzr commented Mar 5, 2024

stas00 commented Mar 5, 2024 • edited Loading

muellerzr commented Mar 5, 2024 • edited Loading

stas00 commented Mar 5, 2024 • edited Loading

stas00 commented Mar 5, 2024

muellerzr commented Mar 5, 2024

muellerzr commented Mar 5, 2024 • edited Loading

stas00 commented Mar 5, 2024

muellerzr commented Mar 7, 2024 • edited Loading

muellerzr commented Mar 7, 2024

stas00 commented Dec 13, 2023 •

edited

Loading

muellerzr commented Mar 4, 2024 •

edited

Loading

muellerzr commented Mar 4, 2024 •

edited

Loading

muellerzr commented Mar 5, 2024 •

edited

Loading

stas00 commented Mar 5, 2024 •

edited

Loading

muellerzr commented Mar 5, 2024 •

edited

Loading

stas00 commented Mar 5, 2024 •

edited

Loading

muellerzr commented Mar 5, 2024 •

edited

Loading

muellerzr commented Mar 7, 2024 •

edited

Loading