Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

nicoboss · 2024-08-09T14:01:16Z

What happened?

This issue is caused by a reappearance of issue #8615 (PR #8622). I recommend reading them for more information about this problem.

For Meta-Llama-3.1-405B-Instruct it turned out that the default llama_model_max_nodes of 8192 is still enough. For its self-merge available under https://huggingface.co/mlabonne/BigLlama-3.1-681B-Instruct (GGUFs available under https://huggingface.co/mradermacher/BigLlama-3.1-681B-Instruct-GGUF/tree/main) it is unfortunately not.

For this issue to be fixed the commented out logic under

llama.cpp/src/llama.cpp

Lines 3571 to 3573 in 3071c0a

    
           //if (model.arch == LLM_ARCH_LLAMA && model.hparams.n_layer > ??) { // llama-3 405B 
        
           //    return 32768; 
        
           //}

needs to be reenabled and set to something like model.hparams.n_layer > 200 for this model to work. Maybe a good approach would be having 0-200 return 8192, >200 return 16384 and >400 return 32768. To play around with this model I made the llama_model_max_nodes function always return 16384 which fixed the issue.

Name and Version

version: 3557 (3071c0a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

./llama-cli -m /bpool/BigLlama-3.1-681B-Instruct.Q4_K_S.gguf -p "I believe the meaning of life is" -c 128 -n 64 -ngl 0

main: build = 3557 (3071c0a5)                                                                                                                                
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu                                                                                           
main: seed  = 1723208630                                                                                                                                     
llama_model_loader: loaded meta data with 32 key-value pairs and 1894 tensors from /bpool/BigLlama-3.1-681B-Instruct.Q4_K_S.gguf (version GGUF V3 (latest))  
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                            
llama_model_loader: - kv   0:                       general.architecture str              = llama                                                            
llama_model_loader: - kv   1:                               general.type str              = model                                                            
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 405B Instruct                                     
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama                                                       
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Meta-Llama-3.1                                                   
llama_model_loader: - kv   6:                         general.size_label str              = 405B                                                             
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1                                                                
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Meta Llama 3.1 405B Instruct                                     
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Meta Llama                                                       
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...                         
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["mergekit", "merge"]                                            
llama_model_loader: - kv  12:                          llama.block_count u32              = 210                                                              
llama_model_loader: - kv  13:                       llama.context_length u32              = 131072                                                           
llama_model_loader: - kv  14:                     llama.embedding_length u32              = 16384                                                            
llama_model_loader: - kv  15:                  llama.feed_forward_length u32              = 53248                                                            
llama_model_loader: - kv  16:                 llama.attention.head_count u32              = 128                                                              
llama_model_loader: - kv  17:              llama.attention.head_count_kv u32              = 16                                                               
llama_model_loader: - kv  18:                       llama.rope.freq_base f32              = 500000.000000                                                    
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                                                         
llama_model_loader: - kv  20:                          general.file_type u32              = 14                                                               
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256                                                           
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128                                                              
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2                                                             
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe                                                        
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...                         
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                         
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...                                   
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000                                                           
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009                                                           
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...                        
llama_model_loader: - kv  31:               general.quantization_version u32              = 2                                                                
llama_model_loader: - type  f32:  422 tensors                                                                                                                
llama_model_loader: - type q4_K: 1441 tensors                                                                                                                
llama_model_loader: - type q5_K:   30 tensors                                                                                                                
llama_model_loader: - type q6_K:    1 tensors                                                                                                                
llm_load_vocab: special tokens cache size = 256                                                                                                              
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)                                                                                                     
llm_load_print_meta: arch             = llama                                                                                                                
llm_load_print_meta: vocab type       = BPE                                                                                                                  
llm_load_print_meta: n_vocab          = 128256                                                                                                               
llm_load_print_meta: n_merges         = 280147                                                                                                               
llm_load_print_meta: vocab_only       = 0                                                                                                                    
llm_load_print_meta: n_ctx_train      = 131072                                                                                                               
llm_load_print_meta: n_embd           = 16384                                                                                                                
llm_load_print_meta: n_layer          = 210                                                                                                                  
llm_load_print_meta: n_head           = 128                                                                                                                  
llm_load_print_meta: n_head_kv        = 16                                                                                                                   
llm_load_print_meta: n_rot            = 128                                                                                                                  
llm_load_print_meta: n_swa            = 0                                                                                                                    
llm_load_print_meta: n_embd_head_k    = 128                                                                                                                  
llm_load_print_meta: n_embd_head_v    = 128                                                                                                                  
llm_load_print_meta: n_gqa            = 8                                                                                                                    
llm_load_print_meta: n_embd_k_gqa     = 2048                                                                                                                 
llm_load_print_meta: n_embd_v_gqa     = 2048                                                                                                                 
llm_load_print_meta: f_norm_eps       = 0.0e+00                                                                                                              
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05                                                                                                              
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                              
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                              
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                              
llm_load_print_meta: n_ff             = 53248                                                                                                                
llm_load_print_meta: n_expert         = 0                                                                                                                    
llm_load_print_meta: n_expert_used    = 0                                                                                                                    
llm_load_print_meta: causal attn      = 1                                                                                                                    
llm_load_print_meta: pooling type     = 0                                                                                                                    
llm_load_print_meta: rope type        = 0                                                                                                                    
llm_load_print_meta: rope scaling     = linear                                                                                                               
llm_load_print_meta: freq_base_train  = 500000.0                                                                                                             
llm_load_print_meta: freq_scale_train = 1                                                                                                                    
llm_load_print_meta: n_ctx_orig_yarn  = 131072                                                                                                               
llm_load_print_meta: rope_finetuned   = unknown                                                                                                              
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 680.67 B
llm_load_print_meta: model size       = 359.76 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 405B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.89 MiB
llm_load_tensors:        CPU buffer size = 368397.47 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   210.00 MiB
llama_new_context_with_model: KV self size  =  210.00 MiB, K (f16):  105.00 MiB, V (f16):  105.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml/src/ggml-backend.c:1936: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed
./llama-cli(+0x5e208)[0x5b4ba469d208]
./llama-cli(+0x60655)[0x5b4ba469f655]
./llama-cli(+0xad7c6)[0x5b4ba46ec7c6]
./llama-cli(+0x10a400)[0x5b4ba4749400]
./llama-cli(+0x1d160c)[0x5b4ba481060c]
./llama-cli(+0x3ec0e)[0x5b4ba467dc0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7b12ad92f24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7b12ad92f305]
./llama-cli(+0x45521)[0x5b4ba4684521]

The text was updated successfully, but these errors were encountered:

slaren · 2024-08-09T22:00:42Z

Something like model.tensors_by_name.size()*5 would likely work well with every model.

eriktrom · 2024-08-10T08:10:54Z

Maybe a good approach would be having 0-200 return 8192, >200 return 16384 and >400 return 32768. To play around with this model I made the llama_model_max_nodes function always return 16384 which fixed the issue.

thanks for sharing, saved a ton of time 👍

nicoboss added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 9, 2024

nicoboss mentioned this issue Aug 10, 2024

llama : model-based max number of graph nodes calculation #8970

Merged

4 tasks

slaren closed this as completed in #8970 Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

nicoboss commented Aug 9, 2024 •

edited

Loading

slaren commented Aug 9, 2024

eriktrom commented Aug 10, 2024

Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

Comments

nicoboss commented Aug 9, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

slaren commented Aug 9, 2024

eriktrom commented Aug 10, 2024

nicoboss commented Aug 9, 2024 •

edited

Loading