You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run a multi-node multi-GPU process on slurm, with 2 nodes with 4 GPUs each, where I am using deepspeed stage 3 to parallelize a 72b parameter model across the GPUs so that it can fit on the available VRAM. here are the main errors I get:
gpu315: Traceback (most recent call last):
gpu315: File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 180, in<module>
gpu315: main()
gpu315: File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 59, in main
gpu315: accelerator = Accelerator()
gpu315: ^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 425, in __init__
gpu315: self.state = AcceleratorState(
gpu315: ^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 861, in __init__
gpu315: PartialState(cpu, **kwargs)
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 204, in __init__
gpu315: dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 685, in init_distributed
gpu315: cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
gpu315: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 112, in __init__
gpu315: self.init_process_group(backend, timeout, init_method, rank, world_size)
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 142, in init_process_group
gpu315: torch.distributed.init_process_group(backend,
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
gpu315: return func(*args, **kwargs)
gpu315: ^^^^^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
gpu315: func_return = func(*args, **kwargs)
gpu315: ^^^^^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1520, in init_process_group
gpu315: store, rank, world_size = next(rendezvous_iterator)
gpu315: ^^^^^^^^^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 269, in _env_rendezvous_handler
gpu315: store = _create_c10d_store(
gpu315: ^^^^^^^^^^^^^^^^^^^
gpu315: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 189, in _create_c10d_store
gpu315: return TCPStore(
gpu315: ^^^^^^^^^
gpu315: RuntimeError: The server socket has failed to listen on any local network address. port: 29608, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[...]
gpu316: [rank6]: Traceback (most recent call last):
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 180, in<module>
gpu316: [rank6]: main()
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 70, in main
gpu316: [rank6]: accelerator.wait_for_everyone()
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2659, in wait_for_everyone
gpu316: [rank6]: wait_for_everyone()
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/utils/other.py", line 144, in wait_for_everyone
gpu316: [rank6]: PartialState().wait_for_everyone()
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 375, in wait_for_everyone
gpu316: [rank6]: torch.distributed.barrier()
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
gpu316: [rank6]: return func(*args, **kwargs)
gpu316: [rank6]: ^^^^^^^^^^^^^^^^^^^^^
gpu316: [rank6]: File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
gpu316: [rank6]: work = group.barrier(opts=opts)
gpu316: [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^
gpu316: [rank6]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
gpu316: [rank6]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
gpu316: [rank6]: Last error:
gpu316: [rank6]: socketStartConnect: Connect to 10.96.4.15<60289> failed : Software caused connection abort
[...]
Here is my code:
# FILE: src/gen/qwen_test_acc.pyimportosfromPILimportImageimportrequestsfromioimportBytesIOimporttorchimporttorch.distributedasdistfromtransformersimportAutoConfig, Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfigfromaccelerateimportAcceleratordefmain():
print(f"torch version: {torch.__version__}")
print(f"cuda version: {torch.version.cuda}")
print(f"cuda available: {torch.cuda.is_available()}")
accelerator=Accelerator()
device=accelerator.device# Typically 'cuda' if GPUs are availablerank=accelerator.process_index# Rank assigned by `accelerate`print(f"Process {os.getpid()} assigned to device: {device} (Rank: {rank})")
accelerator.wait_for_everyone()
# Path to the downloaded modelmodel_path="/projects/surgicalaimodels/models/qwen/v2/72b/vl-instruct"datatype=torch.bfloat16print_memory_usage(datatype=datatype, params=72_000_000_000)
# Load model with `accelerate` environment automatically distributing shards.model=Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=datatype,
local_files_only=True,
)
# Synchronize model loadingaccelerator.wait_for_everyone()
# Verify model consistencyifaccelerator.is_local_main_process:
print(f"Rank {accelerator.process_index} parameter count: {sum(p.numel() forpinmodel.parameters())}")
model=accelerator.prepare(model) # Prepare the model with accelerate# Add a barrier before proceeding to generationaccelerator.wait_for_everyone()
# Load the default processorprocessor=AutoProcessor.from_pretrained(model_path)
# The promptmessages= [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Apply the chat template to the messagestext=processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Download the image and prepare inputsimage_url=messages[0]["content"][0]["image"]
response=requests.get(image_url)
image=Image.open(BytesIO(response.content)).convert("RGB")
# Process the input (text + image)inputs=processor(text=[text], images=[image], return_tensors="pt", padding=True)
# Synchronize model loadingtorch.distributed.barrier()
# # Verify model consistency# if accelerator.is_local_main_process:# print(f"Rank {accelerator.process_index} parameter count: {sum(p.numel() for p in model.parameters())}")# # Move inputs to the appropriate device using accelerator# inputs = accelerator.prepare(inputs)inputs= {k: v.to(model.device) fork, vininputs.items()}
# # Add a barrier before proceeding to generation# accelerator.wait_for_everyone()print(f"inputs device: {inputs['input_ids'].device}, model device: {next(model.parameters()).device}")
# call nvidia-smi to check memory usageifrank==0:
os.system("nvidia-smi")
# Generate outputwithtorch.no_grad():
generated_ids=model.generate(**inputs, max_new_tokens=128)
# Remove input prefix from generated_ids to get only the newly generated tokensgenerated_ids_trimmed= [
out_ids[len(in_ids):] forin_ids, out_idsinzip(inputs["input_ids"], generated_ids)
]
output_text=processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
# Print the final outputifrank==0:
print("==== Output: ====")
print(output_text[0])
ifdist.is_initialized():
print("Destroying process group...")
dist.destroy_process_group()
if__name__=="__main__":
main()
here is my slurm script:
#!/bin/bash -l# FILE: sh/gen/mult_node_qwen_test.sh#### Choose Partition#SBATCH --partition=gpuh100#### cluster specific settings#SBATCH --qos=normal#SBATCH --mem=256G#SBATCH --time=24:00:00#### number of nodes and tasks# nodes#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --nodelist=gpu315,gpu316# GPUs#SBATCH --gpus-per-node=4# CPUs#SBATCH --cpus-per-task=40#### job specific info#SBATCH --job-name="test-qwen"#SBATCH --output="./out/gen/qwen-test-%j.out" # Path to store logs#SBATCH --mail-type=ALL#SBATCH [email protected]######################### Set enviroment ########################## Record setup start time
setup_time=$(date +%s)source~/.bashrc
# Load modules
module purge
module load slurm/rithpc/23.02.8
module load cuda12.1/toolkit/12.1.1
source .venv/bin/activate
############################# Automatic Variables########################### Default to 1 if the environment variable is not set
NUM_MACHINES=${SLURM_NNODES:-1}
NUM_PROCESSES=$((${SLURM_GPUS_PER_NODE:-1}*${SLURM_NNODES:-1}))
MACHINE_RANK=${SLURM_NODEID:-0}
CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}# Generate comma-separated GPU IDs: "0,1,2,3" if SLURM_GPUS_PER_NODE=4
GPU_IDS=$(seq 0 $(( SLURM_GPUS_PER_NODE -1))| paste -sd,)# Set OMP_NUM_THREADS = number of CPUs per taskexport OMP_NUM_THREADS=${CPUS_PER_TASK}# export GPUS_PER_NODEexport GPUS_PER_NODE=${SLURM_GPUS_PER_NODE}# Main process IP address
MAIN_IP=$(scontrol show hostnames $SLURM_JOB_NODELIST| head -n 1)############################## Print Debug Info###########################echo"==== Job Debug Info ===="echo"NUM_MACHINES=${NUM_MACHINES}"echo"NUM_PROCESSES=${NUM_PROCESSES}"echo"MACHINE_RANK=${MACHINE_RANK}"echo"GPU_IDS=${GPU_IDS}"echo"OMP_NUM_THREADS=${OMP_NUM_THREADS}"echo"MAIN_IP=${MAIN_IP}"echo"SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"echo"SLURM_GPUS_PER_NODE=$SLURM_GPUS_PER_NODE"
srun -l bash -c 'echo "Node ID: $SLURM_NODEID"'
srun -l bash -c 'echo "Node ID: $SLURM_PROCID"'echo"========================"echo$LD_LIBRARY_PATH
ldd $(which python)########################## Set Network ############################ Display network interfaces for verificationecho"#### Network Interfaces ####"
ip link show
echo"####"
ifconfig -a
echo"####"######################### Environment Variables ########################## Set environment variables for PyTorchexport TORCH_CPP_LOG_LEVEL=INFO
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=false
export NCCL_IB_DISABLE=1
# Set NCCL to use the bonded Ethernet interfaceexport NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=bond0
export NCCL_P2P_LEVEL=TRACE
# Configure OpenMPI to use TCP, Shared Memory, and Selfexport OMPI_MCA_btl=tcp,sm,self
# Verify environment variable settingsecho"NCCL_SOCKET_IFNAME=$NCCL_SOCKET_IFNAME"echo"OMPI_MCA_btl=$OMPI_MCA_btl"# Verify that srun is in PATHecho"#### PATH ####"echo$PATHecho"####"# Optionally, check the availability of srun
which srun || { echo"ERROR: srun not found in PATH.";exit 1; }
######################### Network and GPUs ########################## Automatically select a free port starting from 29500
PORT=29608
# Check for existing processes using port 29500echo"#### Checking for processes using port 29500 ####"
lsof -i :$PORTecho"###############################################"# Alternatively, use netstat for a different perspectiveecho"#### Checking port $PORT via netstat ####"
netstat -tuln | grep -q $PORTecho"###############################################"echo"#### GPUs Available ####"
nvidia-smi
nvcc --version
echo"CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"echo"###############################################"######################### Run Accelerate Job ########################## Record start timeecho"Job started on $(date)"
start_time=$(date +%s)# Print setup time elapsed
elapsed=$((start_time - setup_time))# Calculate days, hours, minutes, and seconds
days=$((elapsed /86400))
hours=$(( (elapsed %86400) /3600))
minutes=$(( (elapsed %3600) /60))
seconds=$((elapsed %60))echo"Setup Time elapsed: ${days}d ${hours}h ${minutes}m ${seconds}s"# Define your training script and its arguments
INFER_SCRIPT="src/gen/qwen_test_acc.py"# Replace with your actual script
python --version
# Launch the Accelerate job using srun# Use "bf16" if supported; otherwise, use "fp16"
srun accelerate launch --debug \
--use_deepspeed \
--deepspeed_hostfile src/gen/configs/ds_hostfile.txt \
--num_processes=8 \
--num_machines=2 \
--dynamo_backend "no" \
--mixed_precision "bf16" \
--machine_rank=$SLURM_NODEID \
--main_process_ip=$MAIN_IP \
--main_process_port=$PORT \
--deepspeed_config_file src/gen/configs/qwen_ds_config_min.json \
$INFER_SCRIPT######################### Post-Job Actions #########################echo"Job completed on $(date)"
end_time=$(date +%s)# Print setup time elapsed
elapsed=$((end_time - start_time))# Calculate days, hours, minutes, and seconds
days=$((elapsed /86400))
hours=$(( (elapsed %86400) /3600))
minutes=$(( (elapsed %3600) /60))
seconds=$((elapsed %60))# print time elapsedecho"Process Time Elapsed: ${days}d ${hours}h ${minutes}m ${seconds}s"
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
Steps to reproduce:
model: Qwen-VL-72b-Instruct
CUDA version: 12.3
Torch version: 2.5.1+cu121
deepspeed version: 0.16.2
accelerate version: 1.2.1
Expected behavior
The process fails to recognize 2 nodes and communicate with each other. I expected to see the model being parallelized across all the 8 GPUs.
The text was updated successfully, but these errors were encountered:
System Info
I am trying to run a multi-node multi-GPU process on slurm, with 2 nodes with 4 GPUs each, where I am using deepspeed stage 3 to parallelize a 72b parameter model across the GPUs so that it can fit on the available VRAM. here are the main errors I get:
Here is my code:
here is my slurm script:
here is my deepspeed config:
here is my deepspeed hostfile:
here is my full output file:
link: https://cl1p.net/output
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps to reproduce:
Expected behavior
The process fails to recognize 2 nodes and communicate with each other. I expected to see the model being parallelized across all the 8 GPUs.
The text was updated successfully, but these errors were encountered: