Exllamav2 tokenizer kwargs are not used #1322

cpfiffer · 2024-12-05T18:55:53Z

Several of the kwargs in the docstring for the exllamav2 inference engine do not seem to be in use,

Lines 244 to 290 in 36f1bf2

    
           def exl2( 
        
               model_path: str, 
        
               draft_model_path: Optional[str] = None, 
        
               max_seq_len: Optional[int] = None, 
        
               cache_q4: bool = False, 
        
               paged: bool = True, 
        
               max_chunk_size: Optional[int] = None, 
        
           ) -> ExLlamaV2Model: 
        
               """ 
        
               Load an ExLlamaV2 model. 
        
               Parameters 
        
               ---------- 
        
               model_path (str) 
        
                   Path to the model directory. 
        
               device (str) 
        
                   Device to load the model on. Pass in 'cuda' for GPU or 'cpu' for CPU 
        
               max_seq_len (Optional[int], optional) 
        
                   Maximum sequence length. Defaults to None. 
        
               scale_pos_emb (Optional[float], optional) 
        
                   Scale factor for positional embeddings. Defaults to None. 
        
               scale_alpha_value (Optional[float], optional) 
        
                   Scale alpha value. Defaults to None. 
        
               no_flash_attn (Optional[bool], optional) 
        
                   Disable flash attention. Defaults to None. 
        
               num_experts_per_token (Optional[int], optional) 
        
                   Number of experts per token. Defaults to None. 
        
               cache_q4 (bool, optional) 
        
                   Use Q4 cache. Defaults to False. 
        
               tokenizer_kwargs (dict, optional) 
        
                   Additional keyword arguments for the tokenizer. Defaults to {}. 
        
               gpu_split (str) 
        
                   \"auto\", or VRAM allocation per GPU in GB. Auto will use exllama's autosplit feature 
        
               low_mem (bool, optional) 
        
                   Enable VRAM optimizations, potentially trading off speed 
        
               verbose (bool, optional) 
        
                   Enable if you want debugging statements 
        
               Returns 
        
               ------- 
        
               An `ExLlamaV2Model` instance. 
        
               Raises 
        
               ------ 
        
               `ImportError` if the `exllamav2` library is not installed. 
        
               """

The following kwargs do not seem to be used, but are mentioned in the doc string:

tokenizer_kwargs
scale_pos_emb
scale_alpha_value
no_flash_attention
num_experts_per_token
gpu_split
low_mem
verbose

Used but not documented:

draft_model_path
paged
max_chunk_size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllamav2 tokenizer kwargs are not used #1322

Exllamav2 tokenizer kwargs are not used #1322

cpfiffer commented Dec 5, 2024

Exllamav2 tokenizer kwargs are not used #1322

Exllamav2 tokenizer kwargs are not used #1322

Comments

cpfiffer commented Dec 5, 2024