Skip to content

How to implement paged attention in HF format? #616

Open
@fahadh4ilyas

Description

@fahadh4ilyas

So, I just create exllamav2 in HF format and it works well in batch. My code is in #606. Now, I got new problem. Bigger batch means bigger memory usage and mostly is for padding especially if there is different size in token sequence. Could you explain to me how exllamav2 paged attention works in code? I check the code in exllamav2/model.py, PagedParams is used but I don't know what to fill into the parameter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions