Open
Description
So, I just create exllamav2 in HF format and it works well in batch. My code is in #606. Now, I got new problem. Bigger batch means bigger memory usage and mostly is for padding especially if there is different size in token sequence. Could you explain to me how exllamav2 paged attention works in code? I check the code in exllamav2/model.py, PagedParams
is used but I don't know what to fill into the parameter.
Metadata
Metadata
Assignees
Labels
No labels