Replies: 2 comments
-
I fixed the error. I should have put 'shm-size' option for tp shared memory. Thanks One more question, how to set appropriate shm size? Each model needs different shm-size? |
Beta Was this translation helpful? Give feedback.
-
@jooe0824 This report included your huggingface token. I have removed it, but you should assume it has been compromised. I would suggest revoking the token if it is still in use. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to run vllm with 'tensor-parallel' option. But I got the error 'RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)' detailed one is attached below.
Instance is AWS g4dn.12xlarge with 4gpus, and I tried with other model (bigger one) but I got the same error.
How can I use tensor-parallel option (https://docs.vllm.ai/en/stable/serving/distributed_serving.html#multi-node-inference-and-serving) ?
Is there anything I should do with nccl?
$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=" vllm/vllm-openai:latest --model facebook/opt-125m --tensor-parallel-size 4
INFO 08-13 05:59:12 api_server.py:339] vLLM API server version 0.5.4
INFO 08-13 05:59:12 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-13 05:59:12 config.py:729] Defaulting to use mp for distributed inference
INFO 08-13 05:59:12 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-13 05:59:13 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-13 05:59:13 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=110) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=110) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=111) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=111) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
(VllmWorkerProcess pid=112) INFO 08-13 05:59:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=112) INFO 08-13 05:59:13 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=111) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=112) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=110) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=112) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=111) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=110) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=112) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=111) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.(VllmWorkerProcess pid=110) /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch./usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=112) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=111) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=110) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=110) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=111) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=112) INFO 08-13 05:59:15 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=110) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=111) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=112) INFO 08-13 05:59:16 utils.py:841] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=110) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=111) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=112) INFO 08-13 05:59:16 pynccl.py:63] vLLM is using nccl==2.20.5
ERROR 08-13 05:59:16 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 110 died, exit code: -15
INFO 08-13 05:59:16 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
self._run_workers("init_device")
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
initialize_model_parallel(tensor_model_parallel_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
_TP = init_model_parallel_group(group_ranks,
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in init
self.pynccl_comm = PyNcclCommunicator(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
Beta Was this translation helpful? Give feedback.
All reactions