-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Ray[LLM] Assertion: "mma -> mma layout conversion is only supported on Ampere"' failed. #52377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kouroshHakha why are you setting up the placement group yourself? -> I am not aware of any best practices related to placement groups or Ray itself. We have multiple nodes with heterogeneous with V100 and A100s and uneven distribution, and so I was trying to make sure that the model is parallelized on one kind of GPUs and not distributed across different GPUs. |
@nitingoyal0996 you dont need to setup the placement groups yourself when using the serve llm stuff. We automatically handle the correct placement group according to TP and PP settings. You can specify the Can you first try the simple builder pattern (which handles both LLMRouter and LLMServer under the hood) and see what you get? from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config=dict(
model_id="qwen-0.5b",
model_source="Qwen/Qwen2.5-0.5B-Instruct",
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1, max_replicas=2,
)
),
# Pass the desired accelerator type (e.g. A10G, L4, etc.)
accelerator_type="A100-80G",
# You can customize the engine arguments (e.g. vLLM engine kwargs)
engine_kwargs=dict(
tensor_parallel_size=2,
),
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True) |
You said vllm works on your setup? |
@kouroshHakha I tried running it with The vllm serve set up had the exact same configurations and hardware setup as ray[vllm]. here is the command I was using -
Here are the logs when deployed using builder pattern for reference:
|
Hi @nitingoyal0996 so this is an issue with vllm + V100 I think. At first I was able to reproduce the issue with ray[serve]. After that I used vllm directly on V100s and see the same problem. You can try on A10G or A100 (Amper) and see if the issue persists. The vllm cmd: USE_VLLM_V1="0" vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype half --enable-chunked-prefill --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --max-model-len 2048 --max-num-seqs 64 --max-num-batched-tokens 8192
|
My bad, this was because V100 wouldn't support chunked prefill and I was trying to run it with that argument and so vLLM wouldn't let me. Here is the change PR - kaito-project/kaito#971. Thank you for the support @kouroshHakha, appreciate it. |
I was also following the documentation quick start example to refactor our Ray[Serve] + vLLM implementation to Ray[Serve,LLM] and so far it has been challenging.
After navigating my way through this #51242 I was able to complete the deployment without LLMRouter but to be able to use the openai compatible server I tried adding LLMRouter which failed the deployment with the following error -
Here are my complete configs and terminal output -
The text was updated successfully, but these errors were encountered: