Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

Open
wccccp opened this issue Mar 4, 2025 · 1 comment

Comments

@wccccp
Copy link

wccccp commented Mar 4, 2025

 PyTorch version 2.5.1 available. (config.py:54)
[2025-03-04 07:33:00,913] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-03-04 07:33:01,637] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-03-04 07:33:14,735] [    INFO]: This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'. (config.py:549)
[2025-03-04 07:33:14,739] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-03-04 07:33:16,207] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-03-04 07:33:16,667] [    INFO]: Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
[2025-03-04 07:33:17,100] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
[2025-03-04 07:33:17,430] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.57it/s]

[2025-03-04 07:33:18,357] [    INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-03-04 07:33:19,149] [    INFO]: Memory profiling takes 0.65 seconds
the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.80) = 63.28GiB
model weights take 3.35GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 2.05GiB; the rest of the memory reserved for KV Cache is 57.73GiB. (worker.py:267)
[2025-03-04 07:33:19,288] [    INFO]: # cuda blocks: 135124, # CPU blocks: 9362 (executor_base.py:111)
[2025-03-04 07:33:19,288] [    INFO]: Maximum concurrency for 32768 tokens per request: 65.98x (executor_base.py:116)
[2025-03-04 07:33:21,437] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.20it/s]
[2025-03-04 07:33:32,388] [    INFO]: Graph capturing finished in 11 secs, took 0.29 GiB (model_runner.py:1562)
[2025-03-04 07:33:32,389] [    INFO]: init engine (profile, create kv cache, warmup model) took 14.03 seconds (llm_engine.py:436)
[2025-03-04 07:33:33,591] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-03-04 07:33:33,591] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/ifeval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 6 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/tiny_benchmarks/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mt_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 4 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mix_eval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 5 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/olympiade_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/hle/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 21 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/lcb/main.py (registry.py:141)
[2025-03-04 07:33:33,594] [    INFO]: livecodebench/code_generation_lite v4_v5 (lighteval_task.py:187)
[2025-03-04 07:33:33,594] [ WARNING]: Careful, the task extended|lcb:codegeneration is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-03-04 07:33:57,702] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-03-04 07:33:57,702] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-03-04 07:33:57,702] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:468)
[2025-03-04 07:33:57,871] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits:   0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s][2025-03-04 07:33:58,037] [ WARNING]: context_size + max_new_tokens=34588 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:272)
                                                                                                                                                                               [2025-03-04 07:39:28,396] [ WARNING]: Sequence group 16_parallel_sample_4 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1 (scheduler.py:1754)
[2025-03-04 07:42:22,800] [ WARNING]: Sequence group 13_parallel_sample_2 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 (scheduler.py:1754)


                                                                                                                                                                                                                 [2025-03-04 07:54:40,640] [ WARNING]: Sequence group 24_parallel_sample_14 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101 (scheduler.py:1754)

Processed prompts:   0%|                                                                                              | 2/4288 [21:06<698:35:42, 586.78s/it, est. speed input: 2.09 toks/s, output: 305.16 toks/s]
@wccccp
Copy link
Author

wccccp commented Mar 4, 2025

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

# AIME 2024
#TASK=aime24
#lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
 #   --custom-tasks src/open_r1/evaluate.py \
  #  --use-chat-template \
 #   --output-dir $OUTPUT_DIR

# MATH-500
#TASK=math_500
#lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
 #   --custom-tasks src/open_r1/evaluate.py \
  #  --use-chat-template \
   # --output-dir $OUTPUT_DIR

# GPQA Diamond
TASK=gpqa:diamond
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# LiveCodeBench
lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant