Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
-
Updated
May 28, 2026 - C++
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
A comprehensive toolkit for deploying production-ready Generative AI infrastructure on Amazon EKS. Includes pre-configured components for: 🚀 AI Gateway (LiteLLM) 🤖 LLM Serving (vLLM, SGLang, Ollama) 📊 Vector Databases, 🔍 Embedding Models (TEI) 📈 Observability (Langfuse, Phoenix) etc. Fast-track your GenAI deployment with Kubernetes
GPU-aware inference mesh for large-scale AI serving
GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
Mixed-vendor GPU inference cluster manager with speculative decoding
🚀 ClipServe: A fast API server for embedding text, images, and performing zero-shot classification using OpenAI’s CLIP model. Powered by FastAPI, Redis, and CUDA for lightning-fast, scalable AI applications. Transform texts and images into embeddings or classify images with custom labels—all through easy-to-use endpoints. 🌐📊
A FastAPI server for querying Google's Gemma Translate AI models for translations
A high-performance deep learning model inference server based on TensorRT, supporting fast inference for Embedding, Reranker, and NLI models.
Open-source developer tool for testing deAPI.ai endpoints — unified AI inference API for image, video, audio, transcription, OCR and more
Docker based GPU inference of machine learning models
Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.
Generating images with diffusion models on a mobile device, with an intranet GPU box as backend
End-to-end scalable ML inference on EKS: KEDA-driven pod autoscaling with Prometheus custom metrics, Cluster Autoscaler for GPU node scaling, and NVIDIA GPU time-slicing to run multiple pods per GPU.
Making TLB invalidation observable, attributable, and measurable in modern AI workloads.
ModelSpec is an open, declarative specification for describing how AI models especially LLMs are deployed, served, and operated in production. It captures execution, serving, and orchestration intent to enable validation, reasoning, and automation across modern AI infrastructure.
GPU-Accelerated Brain Image Processing Pipeline for OpenNeuro Datasets
A project to build GPU acceleration for LLaMA models on local computers and AWS, leveraging GPU resources for efficient inference and training.
HAL 9000 voice cloning with Qwen3-TTS streaming - OpenAI-compatible REST API
Personal text-to-speech webapp powered by VoxCPM2 — voice design, controllable cloning, and ultimate cloning. Next.js on Vercel + Modal GPU.
4-tier asynchronous LLM cascade system achieving 120 tokens/sec on constrained hardware using llama.cpp, speculative decoding, and GPU+CPU parallel inference
Add a description, image, and links to the gpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the gpu-inference topic, visit your repo's landing page and select "manage topics."