Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: BigLlama-3.1-681B-Instruct requires llama_model_max_nodes to return a higher value #8950

Closed
nicoboss opened this issue Aug 9, 2024 · 2 comments · Fixed by #8970
Closed
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@nicoboss
Copy link
Contributor

nicoboss commented Aug 9, 2024

What happened?

This issue is caused by a reappearance of issue #8615 (PR #8622). I recommend reading them for more information about this problem.

For Meta-Llama-3.1-405B-Instruct it turned out that the default llama_model_max_nodes of 8192 is still enough. For its self-merge available under https://huggingface.co/mlabonne/BigLlama-3.1-681B-Instruct (GGUFs available under https://huggingface.co/mradermacher/BigLlama-3.1-681B-Instruct-GGUF/tree/main) it is unfortunately not.

For this issue to be fixed the commented out logic under

llama.cpp/src/llama.cpp

Lines 3571 to 3573 in 3071c0a

//if (model.arch == LLM_ARCH_LLAMA && model.hparams.n_layer > ??) { // llama-3 405B
// return 32768;
//}
needs to be reenabled and set to something like model.hparams.n_layer > 200 for this model to work. Maybe a good approach would be having 0-200 return 8192, >200 return 16384 and >400 return 32768. To play around with this model I made the llama_model_max_nodes function always return 16384 which fixed the issue.

Name and Version

version: 3557 (3071c0a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

./llama-cli -m /bpool/BigLlama-3.1-681B-Instruct.Q4_K_S.gguf -p "I believe the meaning of life is" -c 128 -n 64 -ngl 0

main: build = 3557 (3071c0a5)                                                                                                                                
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu                                                                                           
main: seed  = 1723208630                                                                                                                                     
llama_model_loader: loaded meta data with 32 key-value pairs and 1894 tensors from /bpool/BigLlama-3.1-681B-Instruct.Q4_K_S.gguf (version GGUF V3 (latest))  
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                                                            
llama_model_loader: - kv   0:                       general.architecture str              = llama                                                            
llama_model_loader: - kv   1:                               general.type str              = model                                                            
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 405B Instruct                                     
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama                                                       
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Meta-Llama-3.1                                                   
llama_model_loader: - kv   6:                         general.size_label str              = 405B                                                             
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1                                                                
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Meta Llama 3.1 405B Instruct                                     
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Meta Llama                                                       
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...                         
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["mergekit", "merge"]                                            
llama_model_loader: - kv  12:                          llama.block_count u32              = 210                                                              
llama_model_loader: - kv  13:                       llama.context_length u32              = 131072                                                           
llama_model_loader: - kv  14:                     llama.embedding_length u32              = 16384                                                            
llama_model_loader: - kv  15:                  llama.feed_forward_length u32              = 53248                                                            
llama_model_loader: - kv  16:                 llama.attention.head_count u32              = 128                                                              
llama_model_loader: - kv  17:              llama.attention.head_count_kv u32              = 16                                                               
llama_model_loader: - kv  18:                       llama.rope.freq_base f32              = 500000.000000                                                    
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                                                         
llama_model_loader: - kv  20:                          general.file_type u32              = 14                                                               
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256                                                           
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128                                                              
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2                                                             
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe                                                        
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...                         
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                         
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...                                   
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000                                                           
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009                                                           
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...                        
llama_model_loader: - kv  31:               general.quantization_version u32              = 2                                                                
llama_model_loader: - type  f32:  422 tensors                                                                                                                
llama_model_loader: - type q4_K: 1441 tensors                                                                                                                
llama_model_loader: - type q5_K:   30 tensors                                                                                                                
llama_model_loader: - type q6_K:    1 tensors                                                                                                                
llm_load_vocab: special tokens cache size = 256                                                                                                              
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)                                                                                                     
llm_load_print_meta: arch             = llama                                                                                                                
llm_load_print_meta: vocab type       = BPE                                                                                                                  
llm_load_print_meta: n_vocab          = 128256                                                                                                               
llm_load_print_meta: n_merges         = 280147                                                                                                               
llm_load_print_meta: vocab_only       = 0                                                                                                                    
llm_load_print_meta: n_ctx_train      = 131072                                                                                                               
llm_load_print_meta: n_embd           = 16384                                                                                                                
llm_load_print_meta: n_layer          = 210                                                                                                                  
llm_load_print_meta: n_head           = 128                                                                                                                  
llm_load_print_meta: n_head_kv        = 16                                                                                                                   
llm_load_print_meta: n_rot            = 128                                                                                                                  
llm_load_print_meta: n_swa            = 0                                                                                                                    
llm_load_print_meta: n_embd_head_k    = 128                                                                                                                  
llm_load_print_meta: n_embd_head_v    = 128                                                                                                                  
llm_load_print_meta: n_gqa            = 8                                                                                                                    
llm_load_print_meta: n_embd_k_gqa     = 2048                                                                                                                 
llm_load_print_meta: n_embd_v_gqa     = 2048                                                                                                                 
llm_load_print_meta: f_norm_eps       = 0.0e+00                                                                                                              
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05                                                                                                              
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                              
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                              
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                              
llm_load_print_meta: n_ff             = 53248                                                                                                                
llm_load_print_meta: n_expert         = 0                                                                                                                    
llm_load_print_meta: n_expert_used    = 0                                                                                                                    
llm_load_print_meta: causal attn      = 1                                                                                                                    
llm_load_print_meta: pooling type     = 0                                                                                                                    
llm_load_print_meta: rope type        = 0                                                                                                                    
llm_load_print_meta: rope scaling     = linear                                                                                                               
llm_load_print_meta: freq_base_train  = 500000.0                                                                                                             
llm_load_print_meta: freq_scale_train = 1                                                                                                                    
llm_load_print_meta: n_ctx_orig_yarn  = 131072                                                                                                               
llm_load_print_meta: rope_finetuned   = unknown                                                                                                              
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 680.67 B
llm_load_print_meta: model size       = 359.76 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 405B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.89 MiB
llm_load_tensors:        CPU buffer size = 368397.47 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   210.00 MiB
llama_new_context_with_model: KV self size  =  210.00 MiB, K (f16):  105.00 MiB, V (f16):  105.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml/src/ggml-backend.c:1936: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed
./llama-cli(+0x5e208)[0x5b4ba469d208]
./llama-cli(+0x60655)[0x5b4ba469f655]
./llama-cli(+0xad7c6)[0x5b4ba46ec7c6]
./llama-cli(+0x10a400)[0x5b4ba4749400]
./llama-cli(+0x1d160c)[0x5b4ba481060c]
./llama-cli(+0x3ec0e)[0x5b4ba467dc0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7b12ad92f24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7b12ad92f305]
./llama-cli(+0x45521)[0x5b4ba4684521]
@nicoboss nicoboss added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 9, 2024
@slaren
Copy link
Collaborator

slaren commented Aug 9, 2024

Something like model.tensors_by_name.size()*5 would likely work well with every model.

@eriktrom
Copy link

Maybe a good approach would be having 0-200 return 8192, >200 return 16384 and >400 return 32768. To play around with this model I made the llama_model_max_nodes function always return 16384 which fixed the issue.

thanks for sharing, saved a ton of time 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants