LLMA is an end-to-end optimizing framework for large language models.
The goal of LLMA is to accelerate large language models inference process both on cloud and embedded environment.
With LLMA framework, different large language models can be deployed to different platforms with high performance in a flexible and easy way.
For the large language model such as LLaMA-7B, LLMA can deploy it on different hardwares like NVIDIA GPU and Cloudblazer Yunsui t20.
LLMA supports doing inference with client requests. Specifically, the client sends an inference request and LLMA returns the inference result to the client.
LLMA supports several optimizing technologies like model fine-tuning and model quantization.
This example demonstrates how to use LLAM to deploy LLaMA-7B on Cloudblazer Yunsui t20.
Apache License 2.0