Skip to content
/ llmaz Public

☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!

License

Notifications You must be signed in to change notification settings

InftyAI/llmaz

Repository files navigation

llmaz

Easy, advanced inference platform for large language models on Kubernetes

stability-alpha GoReport Widget Latest Release

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Architecture

architecture

Features Overview

  • Easy of Use: People can quick deploy a LLM service with minimal configurations.
  • Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
  • Efficient Model Distribution (WIP): Out-of-the-box model cache system support with Manta, still under development right now with architecture reframing.
  • Accelerator Fungibility: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
  • SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
  • Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
  • Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 0.
  • Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to satisfy elastic needs.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Here's a toy example for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

If you're running on CPUs, you can refer to llama.cpp, or more examples here.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

Model

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

More than quick-start

If you want to learn more about this project, please refer to develop.md.

Roadmap

  • Gateway support for traffic routing
  • Metrics support
  • Serverless support for cloud-agnostic users
  • CLI tool support
  • Model training, fine tuning in the long-term

Community

Join us for more discussions:

Contributions

All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.