How to implement paged attention in HF format?

So, I just create exllamav2 in HF format and it works well in batch. My code is in #606. Now, I got new problem. Bigger batch means bigger memory usage and mostly is for padding especially if there is different size in token sequence. Could you explain to me how exllamav2 paged attention works in code? I check the code in [exllamav2/model.py](https://github.com/turboderp/exllamav2/blob/40e37f494488d930bb196b6e01d9c5c8a64456e8/exllamav2/model.py#L942), `PagedParams` is used but I don't know what to fill into the parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to implement paged attention in HF format? #616

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How to implement paged attention in HF format? #616

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions