You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch version 2.5.1 available. (config.py:54)
[2025-03-04 07:33:00,913] [ INFO]: --- LOADING MODEL --- (pipeline.py:186)
[2025-03-04 07:33:01,637] [ INFO]: Automatically detected platform cuda. (__init__.py:207)
[2025-03-04 07:33:14,735] [ INFO]: This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'. (config.py:549)
[2025-03-04 07:33:14,739] [ INFO]: Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234)
[2025-03-04 07:33:16,207] [ INFO]: Using Flash Attention backend. (cuda.py:229)
[2025-03-04 07:33:16,667] [ INFO]: Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
[2025-03-04 07:33:17,100] [ INFO]: Using model weights format ['*.safetensors'] (weight_utils.py:254)
[2025-03-04 07:33:17,430] [ INFO]: No model.safetensors.index.json found in remote. (weight_utils.py:304)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.57it/s]
[2025-03-04 07:33:18,357] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-03-04 07:33:19,149] [ INFO]: Memory profiling takes 0.65 seconds
the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.80) = 63.28GiB
model weights take 3.35GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 2.05GiB; the rest of the memory reserved for KV Cache is 57.73GiB. (worker.py:267)
[2025-03-04 07:33:19,288] [ INFO]: # cuda blocks: 135124, # CPU blocks: 9362 (executor_base.py:111)
[2025-03-04 07:33:19,288] [ INFO]: Maximum concurrency for 32768 tokens per request: 65.98x (executor_base.py:116)
[2025-03-04 07:33:21,437] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00, 3.20it/s]
[2025-03-04 07:33:32,388] [ INFO]: Graph capturing finished in 11 secs, took 0.29 GiB (model_runner.py:1562)
[2025-03-04 07:33:32,389] [ INFO]: init engine (profile, create kv cache, warmup model) took 14.03 seconds (llm_engine.py:436)
[2025-03-04 07:33:33,591] [ INFO]: --- LOADING TASKS --- (pipeline.py:213)
[2025-03-04 07:33:33,591] [ INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/ifeval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 6 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/tiny_benchmarks/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mt_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 4 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/mix_eval/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 5 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/olympiade_bench/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 1 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/hle/main.py (registry.py:141)
[2025-03-04 07:33:33,592] [ INFO]: Found 21 custom tasks in /workspace/tools/miniconda3/envs/openr1/lib/python3.11/site-packages/lighteval/tasks/extended/lcb/main.py (registry.py:141)
[2025-03-04 07:33:33,594] [ INFO]: livecodebench/code_generation_lite v4_v5 (lighteval_task.py:187)
[2025-03-04 07:33:33,594] [ WARNING]: Careful, the task extended|lcb:codegeneration is using evaluation data to build the few shot examples. (lighteval_task.py:260)
[2025-03-04 07:33:57,702] [ INFO]: --- INIT SEEDS --- (pipeline.py:259)
[2025-03-04 07:33:57,702] [ INFO]: --- RUNNING MODEL --- (pipeline.py:464)
[2025-03-04 07:33:57,702] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:468)
[2025-03-04 07:33:57,871] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-03-04 07:33:58,037] [ WARNING]: context_size + max_new_tokens=34588 which is greater than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:272)
[2025-03-04 07:39:28,396] [ WARNING]: Sequence group 16_parallel_sample_4 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1 (scheduler.py:1754)
[2025-03-04 07:42:22,800] [ WARNING]: Sequence group 13_parallel_sample_2 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 (scheduler.py:1754)
[2025-03-04 07:54:40,640] [ WARNING]: Sequence group 24_parallel_sample_14 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101 (scheduler.py:1754)
Processed prompts: 0%| | 2/4288 [21:06<698:35:42, 586.78s/it, est. speed input: 2.09 toks/s, output: 305.16 toks/s]
The text was updated successfully, but these errors were encountered:
The text was updated successfully, but these errors were encountered: