Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate on SLURM: server socket has failed to listen on any local network address #3312

Open
2 of 4 tasks
angadsinghsandhu opened this issue Dec 24, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@angadsinghsandhu
Copy link

angadsinghsandhu commented Dec 24, 2024

System Info

I am trying to run a multi-node multi-GPU process on slurm, with 2 nodes with 4 GPUs each, where I am using deepspeed stage 3 to parallelize a 72b parameter model across the GPUs so that it can fit on the available VRAM. here are the main errors I get:

gpu315: Traceback (most recent call last):
gpu315:   File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 180, in <module>
gpu315:     main()
gpu315:   File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 59, in main
gpu315:     accelerator = Accelerator()
gpu315:                   ^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 425, in __init__
gpu315:     self.state = AcceleratorState(
gpu315:                  ^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 861, in __init__
gpu315:     PartialState(cpu, **kwargs)
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 204, in __init__
gpu315:     dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 685, in init_distributed
gpu315:     cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
gpu315:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 112, in __init__
gpu315:     self.init_process_group(backend, timeout, init_method, rank, world_size)
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 142, in init_process_group
gpu315:     torch.distributed.init_process_group(backend,
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
gpu315:     return func(*args, **kwargs)
gpu315:            ^^^^^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
gpu315:     func_return = func(*args, **kwargs)
gpu315:                   ^^^^^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1520, in init_process_group
gpu315:     store, rank, world_size = next(rendezvous_iterator)
gpu315:                               ^^^^^^^^^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 269, in _env_rendezvous_handler
gpu315:     store = _create_c10d_store(
gpu315:             ^^^^^^^^^^^^^^^^^^^
gpu315:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 189, in _create_c10d_store
gpu315:     return TCPStore(
gpu315:            ^^^^^^^^^
gpu315: RuntimeError: The server socket has failed to listen on any local network address. port: 29608, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[...]
gpu316: [rank6]: Traceback (most recent call last):
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 180, in <module>
gpu316: [rank6]:     main()
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/src/gen/qwen_test_acc.py", line 70, in main
gpu316: [rank6]:     accelerator.wait_for_everyone()
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2659, in wait_for_everyone
gpu316: [rank6]:     wait_for_everyone()
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/utils/other.py", line 144, in wait_for_everyone
gpu316: [rank6]:     PartialState().wait_for_everyone()
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/accelerate/state.py", line 375, in wait_for_everyone
gpu316: [rank6]:     torch.distributed.barrier()
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
gpu316: [rank6]:     return func(*args, **kwargs)
gpu316: [rank6]:            ^^^^^^^^^^^^^^^^^^^^^
gpu316: [rank6]:   File "/home/asandhu9/surgery-copilot/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
gpu316: [rank6]:     work = group.barrier(opts=opts)
gpu316: [rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^
gpu316: [rank6]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
gpu316: [rank6]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
gpu316: [rank6]: Last error:
gpu316: [rank6]: socketStartConnect: Connect to 10.96.4.15<60289> failed : Software caused connection abort
[...]

Here is my code:

# FILE: src/gen/qwen_test_acc.py

import os
from PIL import Image
import requests
from io import BytesIO
import torch
import torch.distributed as dist
from transformers import AutoConfig, Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig

from accelerate import Accelerator

def main():

    print(f"torch version: {torch.__version__}")
    print(f"cuda version: {torch.version.cuda}")
    print(f"cuda available: {torch.cuda.is_available()}")

    accelerator = Accelerator()
    device = accelerator.device  # Typically 'cuda' if GPUs are available
    rank = accelerator.process_index  # Rank assigned by `accelerate`

    print(f"Process {os.getpid()} assigned to device: {device} (Rank: {rank})")

    accelerator.wait_for_everyone()

    # Path to the downloaded model
    model_path = "/projects/surgicalaimodels/models/qwen/v2/72b/vl-instruct"

    datatype = torch.bfloat16
    print_memory_usage(datatype=datatype, params=72_000_000_000)

    # Load model with `accelerate` environment automatically distributing shards.
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=datatype,
        local_files_only=True,
    )

    # Synchronize model loading
    accelerator.wait_for_everyone()

    # Verify model consistency
    if accelerator.is_local_main_process:
        print(f"Rank {accelerator.process_index} parameter count: {sum(p.numel() for p in model.parameters())}")

    model = accelerator.prepare(model)  # Prepare the model with accelerate

    # Add a barrier before proceeding to generation
    accelerator.wait_for_everyone()

    # Load the default processor
    processor = AutoProcessor.from_pretrained(model_path)

    # The prompt
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]

    # Apply the chat template to the messages
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Download the image and prepare inputs
    image_url = messages[0]["content"][0]["image"]
    response = requests.get(image_url)
    image = Image.open(BytesIO(response.content)).convert("RGB")

    # Process the input (text + image)
    inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)

    # Synchronize model loading
    torch.distributed.barrier()

    # # Verify model consistency
    # if accelerator.is_local_main_process:
    #     print(f"Rank {accelerator.process_index} parameter count: {sum(p.numel() for p in model.parameters())}")

    # # Move inputs to the appropriate device using accelerator
    # inputs = accelerator.prepare(inputs)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # # Add a barrier before proceeding to generation
    # accelerator.wait_for_everyone()

    print(f"inputs device: {inputs['input_ids'].device}, model device: {next(model.parameters()).device}")

    # call nvidia-smi to check memory usage
    if rank == 0:
        os.system("nvidia-smi")

    # Generate output
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Remove input prefix from generated_ids to get only the newly generated tokens
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
    ]
    
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    # Print the final output
    if rank == 0:
        print("==== Output: ====")
        print(output_text[0])

    if dist.is_initialized():
        print("Destroying process group...")
        dist.destroy_process_group()

if __name__ == "__main__":
    main()

here is my slurm script:

#!/bin/bash -l
# FILE: sh/gen/mult_node_qwen_test.sh

#### Choose Partition
#SBATCH --partition=gpuh100

#### cluster specific settings
#SBATCH --qos=normal
#SBATCH --mem=256G
#SBATCH --time=24:00:00

#### number of nodes and tasks
# nodes
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --nodelist=gpu315,gpu316
# GPUs
#SBATCH --gpus-per-node=4
# CPUs
#SBATCH --cpus-per-task=40

#### job specific info
#SBATCH --job-name="test-qwen"
#SBATCH --output="./out/gen/qwen-test-%j.out" # Path to store logs
#SBATCH --mail-type=ALL
#SBATCH [email protected]

######################
### Set enviroment ###
######################

# Record setup start time
setup_time=$(date +%s)

source ~/.bashrc

# Load modules
module purge
module load slurm/rithpc/23.02.8
module load cuda12.1/toolkit/12.1.1
source .venv/bin/activate

##########################
### Automatic Variables
##########################

# Default to 1 if the environment variable is not set
NUM_MACHINES=${SLURM_NNODES:-1}
NUM_PROCESSES=$(( ${SLURM_GPUS_PER_NODE:-1} * ${SLURM_NNODES:-1} ))
MACHINE_RANK=${SLURM_NODEID:-0}
CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}

# Generate comma-separated GPU IDs: "0,1,2,3" if SLURM_GPUS_PER_NODE=4
GPU_IDS=$(seq 0 $(( SLURM_GPUS_PER_NODE - 1 )) | paste -sd,)

# Set OMP_NUM_THREADS = number of CPUs per task
export OMP_NUM_THREADS=${CPUS_PER_TASK}

# export GPUS_PER_NODE
export GPUS_PER_NODE=${SLURM_GPUS_PER_NODE}

# Main process IP address
MAIN_IP=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)

###########################
### Print Debug Info
###########################
echo "==== Job Debug Info ===="
echo "NUM_MACHINES=${NUM_MACHINES}"
echo "NUM_PROCESSES=${NUM_PROCESSES}"
echo "MACHINE_RANK=${MACHINE_RANK}"
echo "GPU_IDS=${GPU_IDS}"
echo "OMP_NUM_THREADS=${OMP_NUM_THREADS}"
echo "MAIN_IP=${MAIN_IP}"
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
echo "SLURM_GPUS_PER_NODE=$SLURM_GPUS_PER_NODE"
srun -l bash -c 'echo "Node ID: $SLURM_NODEID"'
srun -l bash -c 'echo "Node ID: $SLURM_PROCID"'
echo "========================"

echo $LD_LIBRARY_PATH
ldd $(which python)

######################
#### Set Network #####
######################

# Display network interfaces for verification
echo "#### Network Interfaces ####"
ip link show
echo "####"
ifconfig -a
echo "####"

######################
### Environment Variables ###
######################

# Set environment variables for PyTorch
export TORCH_CPP_LOG_LEVEL=INFO
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=false
export NCCL_IB_DISABLE=1

# Set NCCL to use the bonded Ethernet interface
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=bond0
export NCCL_P2P_LEVEL=TRACE

# Configure OpenMPI to use TCP, Shared Memory, and Self
export OMPI_MCA_btl=tcp,sm,self

# Verify environment variable settings
echo "NCCL_SOCKET_IFNAME=$NCCL_SOCKET_IFNAME"
echo "OMPI_MCA_btl=$OMPI_MCA_btl"

# Verify that srun is in PATH
echo "#### PATH ####"
echo $PATH
echo "####"

# Optionally, check the availability of srun
which srun || { echo "ERROR: srun not found in PATH."; exit 1; }

######################
### Network and GPUs ###
######################

# Automatically select a free port starting from 29500
PORT=29608

# Check for existing processes using port 29500
echo "#### Checking for processes using port 29500 ####"
lsof -i :$PORT
echo "###############################################"

# Alternatively, use netstat for a different perspective
echo "#### Checking port $PORT via netstat ####"
netstat -tuln | grep -q $PORT
echo "###############################################"

echo "#### GPUs Available ####"
nvidia-smi
nvcc --version
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "###############################################"

######################
### Run Accelerate Job ###
######################

# Record start time
echo "Job started on $(date)"
start_time=$(date +%s)

# Print setup time elapsed
elapsed=$((start_time - setup_time))

# Calculate days, hours, minutes, and seconds
days=$((elapsed / 86400))
hours=$(( (elapsed % 86400) / 3600 ))
minutes=$(( (elapsed % 3600) / 60 ))
seconds=$((elapsed % 60))

echo "Setup Time elapsed: ${days}d ${hours}h ${minutes}m ${seconds}s"

# Define your training script and its arguments
INFER_SCRIPT="src/gen/qwen_test_acc.py"              # Replace with your actual script

python --version

# Launch the Accelerate job using srun
# Use "bf16" if supported; otherwise, use "fp16"
srun accelerate launch --debug \
    --use_deepspeed \
    --deepspeed_hostfile src/gen/configs/ds_hostfile.txt \
    --num_processes=8 \
    --num_machines=2 \
    --dynamo_backend "no" \
    --mixed_precision "bf16" \
    --machine_rank=$SLURM_NODEID \
    --main_process_ip=$MAIN_IP \
    --main_process_port=$PORT \
    --deepspeed_config_file src/gen/configs/qwen_ds_config_min.json \
    $INFER_SCRIPT

######################
### Post-Job Actions ###
######################

echo "Job completed on $(date)"
end_time=$(date +%s)

# Print setup time elapsed
elapsed=$((end_time - start_time))

# Calculate days, hours, minutes, and seconds
days=$((elapsed / 86400))
hours=$(( (elapsed % 86400) / 3600 ))
minutes=$(( (elapsed % 3600) / 60 ))
seconds=$((elapsed % 60))

# print time elapsed
echo "Process Time Elapsed: ${days}d ${hours}h ${minutes}m ${seconds}s"

here is my deepspeed config:

{
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "reduce_scatter": true,
        "contiguous_gradients": true
    },
    "bf16": {
        "enabled": true
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0
}

here is my deepspeed hostfile:

gpu315 slots=4
gpu316 slots=4

here is my full output file:

link: https://cl1p.net/output

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:

  1. model: Qwen-VL-72b-Instruct
  2. CUDA version: 12.3
  3. Torch version: 2.5.1+cu121
  4. deepspeed version: 0.16.2
  5. accelerate version: 1.2.1

Expected behavior

The process fails to recognize 2 nodes and communicate with each other. I expected to see the model being parallelized across all the 8 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant