Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

wccccp · 2025-03-04T08:00:47Z

 PyTorch version 2.5.1 available. (config.py:54)
[2025-03-04 07:33:00,913] [    INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-03-04 07:33:01,637] [    INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-03-04 07:33:14,735] [    INFO]: This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'. (config.py:549)
[2025-03-04 07:33:14,739] [    INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  (llm_engine.py:234)
[2025-03-04 07:33:16,207] [    INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-03-04 07:33:16,667] [    INFO]: Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
[2025-03-04 07:33:17,100] [    INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
[2025-03-04 07:33:17,430] [    INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.57it/s]

[2025-03-04 07:33:18,357] [    INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-03-04 07:33:19,149] [    INFO]: Memory profiling takes 0.65 seconds
the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.80) = 63.28GiB
model weights take 3.35GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 2.05GiB; the rest of the memory reserved for KV Cache is 57.73GiB. (worker.py:267)
[2025-03-04 07:33:19,288] [    INFO]: # cuda blocks: 135124, # CPU blocks: 9362 (executor_base.py:111)
[2025-03-04 07:33:19,288] [    INFO]: Maximum concurrency for 32768 tokens per request: 65.98x (executor_base.py:116)
[2025-03-04 07:33:21,437] [    INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.20it/s]
[2025-03-04 07:33:32,388] [    INFO]: Graph capturing finished in 11 secs, took 0.29 GiB (model_runner.py:1562)
[2025-03-04 07:33:32,389] [    INFO]: init engine (profile, create kv cache, warmup model) took 14.03 seconds (llm_engine.py:436)
[2025-03-04 07:33:33,591] [    INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-03-04 07:33:33,591] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/ifeval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 6 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/tiny_benchmarks/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mt_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 4 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mix_eval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 5 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/olympiade_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/hle/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [    INFO]: Found 21 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/lcb/main.py (registry.py:141)
[2025-03-04 07:33:33,594] [    INFO]: livecodebench/code_generation_lite v4_v5 (lighteval_task.py:187)
[2025-03-04 07:33:33,594] [ WARNING]: Careful, the task extended|lcb:codegeneration is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-03-04 07:33:57,702] [    INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-03-04 07:33:57,702] [    INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-03-04 07:33:57,702] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:468)
[2025-03-04 07:33:57,871] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits:   0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s][2025-03-04 07:33:58,037] [ WARNING]: context_size + max_new_tokens=34588 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:272)
                                                                                                                                                                               [2025-03-04 07:39:28,396] [ WARNING]: Sequence group 16_parallel_sample_4 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1 (scheduler.py:1754)
[2025-03-04 07:42:22,800] [ WARNING]: Sequence group 13_parallel_sample_2 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 (scheduler.py:1754)


                                                                                                                                                                                                                 [2025-03-04 07:54:40,640] [ WARNING]: Sequence group 24_parallel_sample_14 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101 (scheduler.py:1754)

Processed prompts:   0%|                                                                                              | 2/4288 [21:06<698:35:42, 586.78s/it, est. speed input: 2.09 toks/s, output: 305.16 toks/s]

The text was updated successfully, but these errors were encountered:

wccccp · 2025-03-04T08:11:47Z

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

# AIME 2024
#TASK=aime24
#lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
 #   --custom-tasks src/open_r1/evaluate.py \
  #  --use-chat-template \
 #   --output-dir $OUTPUT_DIR

# MATH-500
#TASK=math_500
#lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
 #   --custom-tasks src/open_r1/evaluate.py \
  #  --use-chat-template \
   # --output-dir $OUTPUT_DIR

# GPQA Diamond
TASK=gpqa:diamond
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# LiveCodeBench
lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

wccccp commented Mar 4, 2025

wccccp commented Mar 4, 2025

Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

Is it normal for a 1.5B model on an H100 80G to require several hundred hours for LiveCodeBench? #466

Comments

wccccp commented Mar 4, 2025

wccccp commented Mar 4, 2025