Skip to content

Ray[LLM] Assertion: "mma -> mma layout conversion is only supported on Ampere"' failed. #52377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nitingoyal0996 opened this issue Apr 16, 2025 · 6 comments
Assignees
Labels

Comments

@nitingoyal0996
Copy link

nitingoyal0996 commented Apr 16, 2025

I was also following the documentation quick start example to refactor our Ray[Serve] + vLLM implementation to Ray[Serve,LLM] and so far it has been challenging.

After navigating my way through this #51242 I was able to complete the deployment without LLMRouter but to be able to use the openai compatible server I tried adding LLMRouter which failed the deployment with the following error -

(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,176 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 99ad0f04-9cd0-46ef-ac0a-2c0e59465dee -- CALL llm_config OK 429.5ms
INFO 2025-04-15 10:19:53,924 serve 27 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,618 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Received streaming request 6f652b14-4b89-4d4b-9ad4-cef817e8b260
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,672 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Request 6f652b14-4b89-4d4b-9ad4-cef817e8b260 started. Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Cutting Knowledge Date: December 2023
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Today Date: 26 Jul 2024
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) <|eot_id|><|start_header_id|>user<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:20:35 engine.py:275] Added request 6f652b14-4b89-4d4b-9ad4-cef817e8b260.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 model_runner.py:1562] Graph capturing finished in 8 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(_EngineBackgroundProcess pid=19245) *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245)     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497:     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) Fatal Python error: Aborted
I have combination of V100 and A100s and this exact cluster works fine with older vLLM implementation. I was wondering if there are additional configurations required to handle heterogeneous GPUs.

Here are my complete configs and terminal output -

# LLMConfig
llm_config_dict = {
    "model_loading_config": {
        "model_id": args.get("model", "meta-llama/Llama-3.1-8B-Instruct"),
        "model_source": "meta-llama/Llama-3.1-8B-Instruct",
    },
    "engine_kwargs": {
        # .. vLLM engine arguments
    },
    "deployment_config": {
        "ray_actor_options": {},
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1
        }
    },
    "runtime_env": {
        "pip": ["httpx", "ray[llm,serve]==2.44.1", "vllm==0.7.2"],
        "env_vars": {
            "USE_VLLM_V1": "0",
            "HF_TOKEN": os.getenv("HF_TOKEN")
        }
    }
}
configs = LLMConfig(**llm_config_dict)

bundles=[
    {"CPU": 1, "GPU": 1} 
    for _ in range(int(args["tensor_parallel_size"]))
]

deployment = LLMServer.as_deployment(
    configs.get_serve_options(name_prefix="vLLM:"),
).options(
    placement_group_bundles=bundles,
    placement_group_strategy="PACK"
).bind(configs)

app = LLMRouter.as_deployment().bind(llm_deployments=[deployment])

return app
```
```
# Terminal Logs
(base) nitingoyal:~$ docker run --network host -v ~/storage/tmp/ray:/tmp/ray -e RAY_ADDRESS=████████████:3002 serve:latest

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2025-04-15 10:18:35,401 INFO scripts.py:494 -- Running import path: 'serve:build_app'.
INFO 04-15 10:18:38 __init__.py:190] Automatically detected platform cuda.
2025-04-15 10:18:39,518 INFO worker.py:1520 -- Using address ████████████:3002 set in the environment variable RAY_ADDRESS
2025-04-15 10:18:39,518 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: ████████████:3002...
2025-04-15 10:18:39,529 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at http://████████████:5678 
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,157 proxy ████████████ -- Proxy starting on node 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 (HTTP port: 8000).
INFO 2025-04-15 10:18:42,264 serve 27 -- Started Serve in namespace "serve".
INFO 2025-04-15 10:18:42,281 serve 27 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,243 proxy ████████████ -- Got updated endpoints: {}.
(ServeController pid=18919) INFO 2025-04-15 10:18:42,388 controller 18919 -- Deploying new version of Deployment(name='vLLM:meta-llama--Llama-3_1-8B-Instruct', app='default') (initial target replicas: 1).
(ServeController pid=18919) INFO 2025-04-15 10:18:42,391 controller 18919 -- Deploying new version of Deployment(name='LLMRouter', app='default') (initial target replicas: 2).
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,400 proxy ████████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=18988) INFO 2025-04-15 10:18:42,421 proxy ████████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7f5394515820>.
(ServeController pid=18919) INFO 2025-04-15 10:18:42,502 controller 18919 -- Adding 1 replica to Deployment(name='vLLM:meta-llama--Llama-3_1-8B-Instruct', app='default').
(ServeController pid=18919) INFO 2025-04-15 10:18:42,507 controller 18919 -- Adding 2 replicas to Deployment(name='LLMRouter', app='default').
(ServeReplica:default:LLMRouter pid=2550, ip=████████████) INFO 04-15 10:18:46 __init__.py:190] Automatically detected platform cuda.
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,219 proxy ████████████ -- Proxy starting on node 06553f83498eebaad508fa44d5b1912cc9ec51c786558c089bbe92c7 (HTTP port: 8000).
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,274 proxy ████████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=2630, ip=████████████) INFO 2025-04-15 10:18:48,287 proxy ████████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7fdd1e476810>.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:18:51,153 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- No cloud storage mirror configured
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:18:51,153 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- Downloading the tokenizer for meta-llama/Llama-3.1-8B-Instruct
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) WARNING 04-15 10:18:58 config.py:2386] Casting torch.bfloat16 to torch.float16.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:18:50 __init__.py:190] Automatically detected platform cuda. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:19:05 config.py:542] This model supports multiple tasks: {'embed', 'score', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 04-15 10:19:05 config.py:1556] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:06,348 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Getting the server ready ...
(pid=19245) INFO 04-15 10:19:10 __init__.py:190] Automatically detected platform cuda.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:11 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":64}, use_cached_outputs=True, 
(_EngineBackgroundProcess pid=19245) INFO 2025-04-15 10:19:11,502 serve 19245 -- Clearing the current platform cache ...
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:12 ray_utils.py:180] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:12 ray_utils.py:180] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 06553f83498eebaad508fa44d5b1912cc9ec51c786558c089bbe92c7. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:12 ray_distributed_executor.py:149] use_ray_spmd_worker: False
(_EngineBackgroundProcess pid=19245) Connecting to existing Ray cluster at address: ████████████:3002...
(_EngineBackgroundProcess pid=19245) Calling ray.init() again after it has already been called.
(ServeController pid=18919) WARNING 2025-04-15 10:19:12,539 controller 18919 -- Deployment 'vLLM:meta-llama--Llama-3_1-8B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=18919) WARNING 2025-04-15 10:19:12,540 controller 18919 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(pid=2711, ip=████████████) INFO 04-15 10:19:15 __init__.py:190] Automatically detected platform cuda.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:16,399 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(pid=19315) INFO 04-15 10:19:16 __init__.py:190] Automatically detected platform cuda.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:18 cuda.py:230] Using Flash Attention backend.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:18 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:18 cuda.py:227] Using XFormers backend.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 utils.py:950] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=2711, ip=████████████) WARNING 04-15 10:19:19 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:19 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='████████████', local_reader_ranks=[], buffer_handle=None, local_subscribe_port=None, remote_subscribe_port=43343)
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:20 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:08,  2.78s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.81s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:01,  1.89s/it]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:27,451 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:28 model_runner.py:1115] Loading model weights took 7.5123 GB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 utils.py:950] Found nccl from library libnccl.so.2
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 pynccl.py:69] vLLM is using nccl==2.21.5
(_EngineBackgroundProcess pid=19245) WARNING 04-15 10:19:19 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:19 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:20 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.31s/it]
(_EngineBackgroundProcess pid=19245) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:38,495 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] Memory profiling takes 9.60 seconds
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:39 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 27.13GiB.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:29 model_runner.py:1115] Loading model weights took 7.5123 GB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:40 executor_base.py:110] # CUDA blocks: 20867, # CPU blocks: 4096
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:40 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 163.02x
(ServeController pid=18919) WARNING 2025-04-15 10:19:42,631 controller 18919 -- Deployment 'vLLM:meta-llama--Llama-3_1-8B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=18919) WARNING 2025-04-15 10:19:42,632 controller 18919 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=18919) This may be caused by a slow __init__ or reconfigure method.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:43 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   9%|▉         | 1/11 [00:01<00:11,  1.19s/it]
Capturing CUDA graph shapes:  18%|█▊        | 2/11 [00:01<00:08,  1.10it/s]
Capturing CUDA graph shapes:  27%|██▋       | 3/11 [00:02<00:06,  1.22it/s]
Capturing CUDA graph shapes:  36%|███▋      | 4/11 [00:03<00:05,  1.30it/s]
Capturing CUDA graph shapes:  45%|████▌     | 5/11 [00:04<00:04,  1.32it/s]
Capturing CUDA graph shapes:  55%|█████▍    | 6/11 [00:04<00:03,  1.38it/s]
Capturing CUDA graph shapes:  64%|██████▎   | 7/11 [00:05<00:02,  1.40it/s]
Capturing CUDA graph shapes:  73%|███████▎  | 8/11 [00:06<00:02,  1.45it/s]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:49,547 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Waiting for engine process ...
Capturing CUDA graph shapes:  82%|████████▏ | 9/11 [00:06<00:01,  1.49it/s]
Capturing CUDA graph shapes:  91%|█████████ | 10/11 [00:07<00:00,  1.53it/s]
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:50 model_runner.py:1562] Graph capturing finished in 7 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] Memory profiling takes 9.74 seconds
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] the current vLLM instance can use total_gpu_memory (31.73GiB) x gpu_memory_utilization (0.90) = 28.56GiB
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:39 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 20.38GiB.
(RayWorkerWrapper pid=2711, ip=████████████) INFO 04-15 10:19:43 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 21.23 seconds
Capturing CUDA graph shapes: 100%|██████████| 11/11 [00:07<00:00,  1.38it/s]
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:51,593 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- [STATUS] Server is ready.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:51,594 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 -- Started vLLM engine.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,160 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 fc373110-5b2a-4f6a-bd8d-e1da77d4bcf3 -- CALL llm_config OK 415.5ms
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:19:52,176 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 99ad0f04-9cd0-46ef-ac0a-2c0e59465dee -- CALL llm_config OK 429.5ms
INFO 2025-04-15 10:19:53,924 serve 27 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,618 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Received streaming request 6f652b14-4b89-4d4b-9ad4-cef817e8b260
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) INFO 2025-04-15 10:20:35,672 default_vLLM:meta-llama--Llama-3_1-8B-Instruct met6piw3 6f652b14-4b89-4d4b-9ad4-cef817e8b260 -- Request 6f652b14-4b89-4d4b-9ad4-cef817e8b260 started. Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Cutting Knowledge Date: December 2023
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Today Date: 26 Jul 2024
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) <|eot_id|><|start_header_id|>user<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(ServeReplica:default:vLLM:meta-llama--Llama-3_1-8B-Instruct pid=19144) 
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:20:35 engine.py:275] Added request 6f652b14-4b89-4d4b-9ad4-cef817e8b260.
(_EngineBackgroundProcess pid=19245) INFO 04-15 10:19:51 model_runner.py:1562] Graph capturing finished in 8 secs, took 0.17 GiB
(_EngineBackgroundProcess pid=19245) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(_EngineBackgroundProcess pid=19245) *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245)     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: *** SIGABRT received at time=1744737636 on cpu 27 ***
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497: PC: @     0x7fee3f25b9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=19245) [2025-04-15 10:20:36,222 E 19245 19245] logging.cc:497:     @     0x7fee3f207520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=19245) Fatal Python error: Aborted
(_EngineBackgroundProcess pid=19245) 
(_EngineBackgroundProcess pid=19245) Stack (most recent call first):
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 216 in make_llir
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 318 in <lambda>
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 282 in compile
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 662 in run
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py", line 827 in context_attention_fwd
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/paged_attn.py", line 213 in forward_prefix
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 573 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 307 in unified_attention
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/_ops.py", line 1116 in __call__
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 201 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 203 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 279 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 365 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172 in __call__
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 541 in forward
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1719 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 413 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 2220 in run_method
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 566 in execute_method
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 401 in _driver_execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 275 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 408 in execute_model
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1386 in step
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 209 in engine_step
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 200 in run_engine_loop
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 137 in start
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 242 in start
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 945 in main_loop
(_EngineBackgroundProcess pid=19245)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 320 in <module>
(_EngineBackgroundProcess pid=19245) 
(_EngineBackgroundProcess pid=19245) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, uvloop.loop, ray._raylet, grpc._cython.cygrpc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, markupsafe._speedups, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, msgspec._core, PIL._imagingft, _cffi_backend, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zstandard.backend_c, pyarrow._json, vllm.cumem_allocator, sentencepiece._sentencepiece, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, lz4._version, lz4.frame._frame, cuda_utils, __triton_launcher (total: 159)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffc00c37315c503285c649d48b25000000 Worker ID: 0807e63b47fe35f390c29e38062af140187c0bd495c31ff2bb4e0610 Node ID: 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 Worker IP address: ████████████ Worker port: 10203 Worker PID: 19245 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
```
This has blocked us from moving forward. I will really appreciate any help with this issue.
@nitingoyal0996
Copy link
Author

@kouroshHakha why are you setting up the placement group yourself? -> I am not aware of any best practices related to placement groups or Ray itself. We have multiple nodes with heterogeneous with V100 and A100s and uneven distribution, and so I was trying to make sure that the model is parallelized on one kind of GPUs and not distributed across different GPUs.
I tried not using placement groups configurations but that too resulted in the same error.

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Apr 16, 2025

@nitingoyal0996 you dont need to setup the placement groups yourself when using the serve llm stuff. We automatically handle the correct placement group according to TP and PP settings.

You can specify the accelerator_type to separate out A100s from V100 and make sure you schedule the server on the proper accelerator.

Can you first try the simple builder pattern (which handles both LLMRouter and LLMServer under the hood) and see what you get?

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A100-80G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=2,
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

@kouroshHakha
Copy link
Contributor

You said vllm works on your setup? vllm serve .... ? Can you also share your vllm serve ... command that works + the hardware setup and vllm version that you sanity checked the deployment on?

@nitingoyal0996
Copy link
Author

nitingoyal0996 commented Apr 16, 2025

@kouroshHakha I tried running it with "accelerator_type": "V100" and using builder pattern, I got the same issue. The deployment you see was a success but then when I made a curl request it broke the deployment too.

The vllm serve set up had the exact same configurations and hardware setup as ray[vllm]. here is the command I was using -

serve run llm_serve:build_app \
  model="meta-llama/Llama-3.1-8B-Instruct" \
  dtype=half \
  gpu_memory_utilization=0.90 \
  enable_chunked_prefill=true \
  tensor_parallel_size=2 \
  max_model_len=2048 \
  max_num_seqs=64 \
  max_num_batched_tokens=8192

Here are the logs when deployed using builder pattern for reference:

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2025-04-16 15:10:34,766 INFO scripts.py:494 -- Running import path: 'serve:build_app'.
INFO 04-16 15:10:38 __init__.py:190] Automatically detected platform cuda.
2025-04-16 15:10:38,934 INFO worker.py:1520 -- Using address ██████████:3002 set in the environment variable RAY_ADDRESS
2025-04-16 15:10:38,934 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: ██████████:3002...
2025-04-16 15:10:38,945 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at http://██████████:5678 
(raylet) [2025-04-16 15:10:38,949 E 296 296] (raylet) process.cc:307: Process 27 does not exist.
(ProxyActor pid=55720) INFO 2025-04-16 15:10:41,572 proxy ██████████ -- Proxy starting on node 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 (HTTP port: 8000).
INFO 2025-04-16 15:10:41,688 serve 27 -- Started Serve in namespace "serve".
INFO 2025-04-16 15:10:41,708 serve 27 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ProxyActor pid=55720) INFO 2025-04-16 15:10:41,666 proxy ██████████ -- Got updated endpoints: {}.
(ServeController pid=55650) INFO 2025-04-16 15:10:41,753 controller 55650 -- Deploying new version of Deployment(name='LLMDeployment:meta-llama--Llama-3_1-8B-Instruct', app='default') (initial target replicas: 1).
(ServeController pid=55650) INFO 2025-04-16 15:10:41,756 controller 55650 -- Deploying new version of Deployment(name='LLMRouter', app='default') (initial target replicas: 2).
(ProxyActor pid=55720) INFO 2025-04-16 15:10:41,764 proxy ██████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ServeController pid=55650) INFO 2025-04-16 15:10:41,867 controller 55650 -- Adding 1 replica to Deployment(name='LLMDeployment:meta-llama--Llama-3_1-8B-Instruct', app='default').
(ServeController pid=55650) INFO 2025-04-16 15:10:41,872 controller 55650 -- Adding 2 replicas to Deployment(name='LLMRouter', app='default').
(ProxyActor pid=55720) INFO 2025-04-16 15:10:41,778 proxy ██████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7f0b7ddd9b20>.
(ServeReplica:default:LLMRouter pid=3312, ip=██████████) INFO 04-16 15:10:46 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 04-16 15:10:47 __init__.py:190] Automatically detected platform cuda.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:10:47,879 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- No cloud storage mirror configured
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:10:47,879 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- Downloading the tokenizer for meta-llama/Llama-3.1-8B-Instruct
(ProxyActor pid=3393, ip=██████████) INFO 2025-04-16 15:10:48,269 proxy ██████████ -- Proxy starting on node 06553f83498eebaad508fa44d5b1912cc9ec51c786558c089bbe92c7 (HTTP port: 8000).
(ProxyActor pid=3393, ip=██████████) INFO 2025-04-16 15:10:48,318 proxy ██████████ -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=3393, ip=██████████) INFO 2025-04-16 15:10:48,338 proxy ██████████ -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7f92cbd6a450>.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) WARNING 04-16 15:10:50 config.py:2386] Casting torch.bfloat16 to torch.float16.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 04-16 15:10:57 config.py:542] This model supports multiple tasks: {'classify', 'generate', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 04-16 15:10:57 config.py:1556] Chunked prefill is enabled with max_num_batched_tokens=8192.
(ServeReplica:default:LLMRouter pid=55791) INFO 04-16 15:10:47 __init__.py:190] Automatically detected platform cuda.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:10:58,572 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- [STATUS] Getting the server ready ...
(pid=55956) INFO 04-16 15:11:03 __init__.py:190] Automatically detected platform cuda.
(_EngineBackgroundProcess pid=55956) INFO 2025-04-16 15:11:03,740 serve 55956 -- Clearing the current platform cache ...
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:03 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":64}, use_cached_outputs=True, 
(_EngineBackgroundProcess pid=55956) Connecting to existing Ray cluster at address: ██████████:3002...
(_EngineBackgroundProcess pid=55956) Calling ray.init() again after it has already been called.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:04 ray_distributed_executor.py:149] use_ray_spmd_worker: False
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:08,623 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- [STATUS] Waiting for engine process ...
(pid=56027) INFO 04-16 15:11:08 __init__.py:190] Automatically detected platform cuda.
(pid=56028) INFO 04-16 15:11:08 __init__.py:190] Automatically detected platform cuda.
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:10 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:10 cuda.py:227] Using XFormers backend.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:11 utils.py:950] Found nccl from library libnccl.so.2
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:11 pynccl.py:69] vLLM is using nccl==2.21.5
(ServeController pid=55650) WARNING 2025-04-16 15:11:11,918 controller 55650 -- Deployment 'LLMDeployment:meta-llama--Llama-3_1-8B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=55650) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=55650) WARNING 2025-04-16 15:11:11,919 controller 55650 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=55650) This may be caused by a slow __init__ or reconfigure method.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:11 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ray/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:11 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_bca3539a'), local_subscribe_port=36199, remote_subscribe_port=None)
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:12 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:12 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:13,  4.50s/it]
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:19,671 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- [STATUS] Waiting for engine process ...
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:06,  3.34s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.67s/it]
(_EngineBackgroundProcess pid=55956) 
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:23 model_runner.py:1115] Loading model weights took 7.5123 GB
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:10 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:10 cuda.py:227] Using XFormers backend.
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:11 utils.py:950] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:11 pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:11 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ray/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:12 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:12 weight_utils.py:252] Using model weights format ['*.safetensors']
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:27 worker.py:267] Memory profiling takes 2.98 seconds
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:27 worker.py:267] the current vLLM instance can use total_gpu_memory (31.73GiB) x gpu_memory_utilization (0.90) = 28.56GiB
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:27 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.32GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 20.21GiB.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:28 executor_base.py:110] # CUDA blocks: 20657, # CPU blocks: 4096
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:28 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 161.38x
Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:30 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:24 model_runner.py:1115] Loading model weights took 7.5123 GB
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:30,709 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- [STATUS] Waiting for engine process ...
Capturing CUDA graph shapes:   9%|▉         | 1/11 [00:00<00:07,  1.32it/s]
Capturing CUDA graph shapes:  18%|█▊        | 2/11 [00:01<00:05,  1.51it/s]
Capturing CUDA graph shapes:  27%|██▋       | 3/11 [00:01<00:05,  1.54it/s]
Capturing CUDA graph shapes:  36%|███▋      | 4/11 [00:02<00:04,  1.59it/s]
Capturing CUDA graph shapes:  45%|████▌     | 5/11 [00:03<00:03,  1.55it/s]
Capturing CUDA graph shapes:  55%|█████▍    | 6/11 [00:03<00:03,  1.59it/s]
Capturing CUDA graph shapes:  64%|██████▎   | 7/11 [00:04<00:02,  1.59it/s]
Capturing CUDA graph shapes:  73%|███████▎  | 8/11 [00:05<00:01,  1.62it/s]
Capturing CUDA graph shapes:  82%|████████▏ | 9/11 [00:05<00:01,  1.64it/s]
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:36 custom_all_reduce.py:226] Registering 715 cuda graph addresses
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:27 worker.py:267] Memory profiling takes 3.19 seconds
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:27 worker.py:267] the current vLLM instance can use total_gpu_memory (31.73GiB) x gpu_memory_utilization (0.90) = 28.56GiB
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:27 worker.py:267] model weights take 7.51GiB; non_torch_memory takes 0.35GiB; PyTorch activation peak memory takes 0.52GiB; the rest of the memory reserved for KV Cache is 20.17GiB.
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:30 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:  91%|█████████ | 10/11 [00:06<00:00,  1.65it/s]
Capturing CUDA graph shapes: 100%|██████████| 11/11 [00:06<00:00,  1.58it/s]
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:37 model_runner.py:1562] Graph capturing finished in 7 secs, took 0.10 GiB
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:37 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 13.30 seconds
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:37,913 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- [STATUS] Server is ready.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:37,913 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 -- Started vLLM engine.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:38,330 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 6039c150-3b82-4d5f-85d3-bdca147954c3 -- CALL llm_config OK 229.1ms
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:11:38,541 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 2c58683e-c2ef-4398-913b-18a87f6af159 -- CALL llm_config OK 215.6ms
INFO 2025-04-16 15:11:40,132 serve 27 -- Application 'default' is ready at http://127.0.0.1:8000/.
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:12:06,264 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- Received streaming request ccfbe2fe-b308-4e9d-9b74-fd38516cffb5
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:12:06,321 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- Request ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 started. Prompt: <|begin_of_text|><|start_header_id|>system<|end_header_id|>
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) 
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) Cutting Knowledge Date: December 2023
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) Today Date: 26 Jul 2024
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) 
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) <|eot_id|><|start_header_id|>user<|end_header_id|>
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) 
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) How do you suggest navigating the tough job market in software engineering?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) 
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) 
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:12:06 engine.py:275] Added request ccfbe2fe-b308-4e9d-9b74-fd38516cffb5.
(_EngineBackgroundProcess pid=55956) INFO 04-16 15:11:37 custom_all_reduce.py:226] Registering 715 cuda graph addresses
(RayWorkerWrapper pid=56028) INFO 04-16 15:11:37 model_runner.py:1562] Graph capturing finished in 7 secs, took 0.10 GiB
(RayWorkerWrapper pid=56028) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(RayWorkerWrapper pid=56028) *** SIGABRT received at time=1744841526 on cpu 16 ***
(RayWorkerWrapper pid=56028) PC: @     0x7fae5c1049fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=56028)     @     0x7fae5c0b0520  (unknown)  (unknown)
(RayWorkerWrapper pid=56028) [2025-04-16 15:12:06,448 E 56028 56028] logging.cc:497: *** SIGABRT received at time=1744841526 on cpu 16 ***
(RayWorkerWrapper pid=56028) [2025-04-16 15:12:06,448 E 56028 56028] logging.cc:497: PC: @     0x7fae5c1049fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=56028) [2025-04-16 15:12:06,448 E 56028 56028] logging.cc:497:     @     0x7fae5c0b0520  (unknown)  (unknown)
(RayWorkerWrapper pid=56028) Fatal Python error: Aborted
(RayWorkerWrapper pid=56028) 
(RayWorkerWrapper pid=56028) Stack (most recent call first):
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 216 in make_llir
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 318 in <lambda>
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 282 in compile
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 662 in run
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py", line 827 in context_attention_fwd
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/paged_attn.py", line 213 in forward_prefix
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 573 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 307 in unified_attention
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/_ops.py", line 1116 in __call__
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 201 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 203 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 279 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 365 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172 in __call__
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 541 in forward
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1719 in execute_model
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 413 in execute_model
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 93 in start_worker_execution_loop
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 2220 in run_method
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 566 in execute_method
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 945 in main_loop
(RayWorkerWrapper pid=56028)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 320 in <module>
(RayWorkerWrapper pid=56028) 
(RayWorkerWrapper pid=56028) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, PIL._imagingft, _cffi_backend, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zstandard.backend_c, pyarrow.lib, pyarrow._hdfsio, pyarrow._json, vllm.cumem_allocator, sentencepiece._sentencepiece, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, lz4._version, lz4.frame._frame, cuda_utils, __triton_launcher (total: 113)
(_EngineBackgroundProcess pid=55956) 
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 401 in _driver_execute_model
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1386 in step
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 209 in engine_step
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 200 in run_engine_loop
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 137 in start
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 242 in start
(_EngineBackgroundProcess pid=55956) 
(_EngineBackgroundProcess pid=55956) Extension modules: msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, uvloop.loop, ray._raylet, grpc._cython.cygrpc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, markupsafe._speedups, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, msgspec._core, PIL._imagingft, _cffi_backend, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zstandard.backend_c, pyarrow._json, vllm.cumem_allocator, sentencepiece._sentencepiece, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, lz4._version, lz4.frame._frame, cuda_utils, __triton_launcher (total: 159)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff4b55232facb20f60cdf4654829000000 Worker ID: 1cd9d454697ad0085230346e3901d71bb8ca355053b1c58e76446727 Node ID: 348553707a378daeb5edea16f5fe9c5aafbdcd7ff550361a6988df81 Worker IP address: ██████████ Worker port: 10232 Worker PID: 56028 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(_EngineBackgroundProcess pid=55956) /home/ray/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
(_EngineBackgroundProcess pid=55956)   warnings.warn('resource_tracker: There appear to be %d '
(ProxyActor pid=55720) INFO 2025-04-16 15:12:17,295 proxy ██████████ ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- Client for request ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 disconnected, cancelling request.
(_EngineBackgroundProcess pid=55956) /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
(_EngineBackgroundProcess pid=55956) *** SIGABRT received at time=1744841526 on cpu 2 ***
(_EngineBackgroundProcess pid=55956) PC: @     0x7f65f0d2e9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=55956)     @     0x7f65f0cda520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=55956) [2025-04-16 15:12:06,596 E 55956 55956] logging.cc:497: *** SIGABRT received at time=1744841526 on cpu 2 ***
(_EngineBackgroundProcess pid=55956) [2025-04-16 15:12:06,596 E 55956 55956] logging.cc:497: PC: @     0x7f65f0d2e9fc  (unknown)  pthread_kill
(_EngineBackgroundProcess pid=55956) [2025-04-16 15:12:06,596 E 55956 55956] logging.cc:497:     @     0x7f65f0cda520  (unknown)  (unknown)
(_EngineBackgroundProcess pid=55956) Fatal Python error: Aborted
(_EngineBackgroundProcess pid=55956) Stack (most recent call first):
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 216 in make_llir
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345 in <lambda> [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 282 in compile
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 662 in run
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py", line 827 in context_attention_fwd
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context [repeated 2x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/ops/paged_attn.py", line 213 in forward_prefix
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 541 in forward [repeated 6x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 307 in unified_attention
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172 in __call__ [repeated 2x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747 in _call_impl [repeated 4x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl [repeated 4x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/executor/ray_distributed_executor.py", line 408 in execute_model [repeated 4x across cluster]
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 2220 in run_method
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 566 in execute_method
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 945 in main_loop
(_EngineBackgroundProcess pid=55956)   File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 320 in <module>
(ServeReplica:default:LLMRouter pid=55791) INFO 2025-04-16 15:12:17,296 default_LLMRouter 824re4au ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- POST /v1/chat/completions CANCELLED 11052.6ms
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) INFO 2025-04-16 15:12:17,296 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- CALL /v1/chat/completions CANCELLED 11034.1ms
(ServeReplica:default:LLMDeployment:meta-llama--Llama-3_1-8B-Instruct pid=55790) WARNING 2025-04-16 15:12:17,297 default_LLMDeployment:meta-llama--Llama-3_1-8B-Instruct hw8qbem0 ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 -- Request ccfbe2fe-b308-4e9d-9b74-fd38516cffb5 has been cancelled

@kouroshHakha kouroshHakha self-assigned this Apr 17, 2025
@kouroshHakha kouroshHakha added P0 Issues that should be fixed in short order llm labels Apr 17, 2025
@kouroshHakha
Copy link
Contributor

kouroshHakha commented Apr 17, 2025

Hi @nitingoyal0996 so this is an issue with vllm + V100 I think. At first I was able to reproduce the issue with ray[serve]. After that I used vllm directly on V100s and see the same problem. You can try on A10G or A100 (Amper) and see if the issue persists.

The vllm cmd:

USE_VLLM_V1="0" vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype half --enable-chunked-prefill --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --max-model-len 2048 --max-num-seqs 64 --max-num-batched-tokens 8192
INFO 04-17 09:31:26 [chat_utils.py:379] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 04-17 09:31:26 [logger.py:39] Received request chatcmpl-78bf7db777334fc9b4bea7dfe76f6e6f: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant that outputs JSON.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nList three colors in JSON format<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1998, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:54546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-17 09:31:26 [engine.py:310] Added request chatcmpl-78bf7db777334fc9b4bea7dfe76f6e6f.
LLVM ERROR: Failed to compute parent layout for slice layout.
LLVM ERROR: Failed to compute parent layout for slice layout.
ERROR 04-17 09:31:29 [client.py:305] RuntimeError('Engine process (pid 10381) died.')
ERROR 04-17 09:31:29 [client.py:305] NoneType: None
ERROR 04-17 09:31:33 [serving_chat.py:757] Error in chat completion stream generator.
ERROR 04-17 09:31:33 [serving_chat.py:757] Traceback (most recent call last):
ERROR 04-17 09:31:33 [serving_chat.py:757]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 376, in chat_completion_stream_generator
ERROR 04-17 09:31:33 [serving_chat.py:757]     async for res in result_generator:
ERROR 04-17 09:31:33 [serving_chat.py:757]   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 664, in _process_request
ERROR 04-17 09:31:33 [serving_chat.py:757]     raise request_output
ERROR 04-17 09:31:33 [serving_chat.py:757] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('Engine process (pid 10381) died.').

@kouroshHakha kouroshHakha removed the P0 Issues that should be fixed in short order label Apr 17, 2025
@nitingoyal0996
Copy link
Author

My bad, this was because V100 wouldn't support chunked prefill and I was trying to run it with that argument and so vLLM wouldn't let me. Here is the change PR - kaito-project/kaito#971.

Thank you for the support @kouroshHakha, appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants