Build software better, together

brontoguana / krasis

Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware

transformer inference-engine inference-optimization mixture-of-experts cpu-inference large-language-models gpu-inference llm-inference high-performance-inference hybrid-inference gguf-model-support llama-cpp-alternative

Updated May 28, 2026
C++

aws-samples / sample-genai-on-eks-starter-kit

Star

A comprehensive toolkit for deploying production-ready Generative AI infrastructure on Amazon EKS. Includes pre-configured components for: 🚀 AI Gateway (LiteLLM) 🤖 LLM Serving (vLLM, SGLang, Ollama) 📊 Vector Databases, 🔍 Embedding Models (TEI) 📈 Observability (Langfuse, Phoenix) etc. Fast-track your GenAI deployment with Kubernetes

kubernetes aws terraform ai-agents amazon-eks ai-platform vector-database ai-engineering generative-ai llmops llm-serving gpu-inference vllm llm-inference langfuse ai-gateway agentic-ai mcp-server nvidia-dynamo

Updated May 26, 2026
JavaScript

redbco / infermesh

Star

GPU-aware inference mesh for large-scale AI serving

rust distributed-systems fault-tolerance high-availability service-mesh observability inference-engine model-serving ml-infrastructure ai-inference gpu-inference ai-infrastructure gpu-mesh

Updated Sep 25, 2025
Rust

Talnz007 / VulkanIlm

Star

GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).

machine-learning vulkan python-wrapper fastai amd-gpu intel-gpu llama-cpp gpu-inference llm-inference localllm local-ai open-source-llm llama-cpp-python gguf legacy-gpus

Updated Oct 14, 2025
Python

youngharold / tightwad

Star

Mixed-vendor GPU inference cluster manager with speculative decoding

Updated Apr 28, 2026
Python

Armaggheddon / ClipServe

Star

🚀 ClipServe: A fast API server for embedding text, images, and performing zero-shot classification using OpenAI’s CLIP model. Powered by FastAPI, Redis, and CUDA for lightning-fast, scalable AI applications. Transform texts and images into embeddings or classify images with custom labels—all through easy-to-use endpoints. 🌐📊

docker redis docker-compose cuda transformers python3 gradio fastapi openai-clip gpu-inference

Updated Sep 29, 2024
Python

grctest / fastapi-gemma-translate

Star

A FastAPI server for querying Google's Gemma Translate AI models for translations

docker google cuda translate gemma ai-api fastapi cpu-inference gpu-inference translategemma gemmatranslate

Updated Apr 26, 2026
Python

LLMSystems / TensorrtServer

Star

A high-performance deep learning model inference server based on TensorRT, supporting fast inference for Embedding, Reranker, and NLI models.

high-performance cuda embeddings inference-server tensorrt reranking model-serving nli openai-api dynamic-batching gpu-inference

Updated Mar 18, 2026
Python

deapi-ai / deapi-tester

Star

Open-source developer tool for testing deAPI.ai endpoints — unified AI inference API for image, video, audio, transcription, OCR and more

flux text-to-speech typescript nextjs api-client developer-tools image-generation unified-api ai-api video-generation ai-models ai-inference stable-diffusion gpu-inference deapi

Updated Apr 17, 2026
TypeScript

mpaepper / docker-cifar

Star

Docker based GPU inference of machine learning models

deep-learning pytorch gpu-inference

Updated May 9, 2019
Python

caimari / vtts

Star

Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.

text-to-speech pytorch tts speech-synthesis voice-synthesis voice-cloning voice-agent gpu-inference vllm continuous-batching real-time-tts qwen3-tts

Updated Mar 15, 2026
Python

jozsefszalma / intranet_image_generator

Star

Generating images with diffusion models on a mobile device, with an intranet GPU box as backend

python fun react-native backend rest-api mobile-app pytorch image-generation diffusion prompt-engineering huggingface-diffusers gpu-inference

Updated Dec 3, 2025
Jupyter Notebook

FurkanAtass / Scalable-ML-Inference-Eks

Star

End-to-end scalable ML inference on EKS: KEDA-driven pod autoscaling with Prometheus custom metrics, Cluster Autoscaler for GPU node scaling, and NVIDIA GPU time-slicing to run multiple pods per GPU.

kubernetes machine-learning terraform scalability prometheus aws-eks mlops keda gpu-inference

Updated Aug 29, 2025
HCL

manishklach / tlb-invalidation-lab

Star

Making TLB invalidation observable, attributable, and measurable in modern AI workloads.

performance x86-64 linux-kernel arm64 mmu observability tlb systems-programming ftrace kernel-tracing gpu-inference ai-infrastructure

Updated Apr 28, 2026
Python

paralleliq / modelspec

Star

ModelSpec is an open, declarative specification for describing how AI models especially LLMs are deployed, served, and operated in production. It captures execution, serving, and orchestration intent to enable validation, reasoning, and automation across modern AI infrastructure.

kubernetes runtime json-schema inference autoscaling observability model-deployment model-serving declarative-config production-ai mlops ai-systems llm ai-ops gpu-inference vllm llm-deployment ai-infrastructure modelspec

Updated Apr 27, 2026
Python

hkevin01 / NeuroAccel

Star

GPU-Accelerated Brain Image Processing Pipeline for OpenNeuro Datasets

Updated May 19, 2026
Python

hkevin01 / Llama-GPU

Star

A project to build GPU acceleration for LLaMA models on local computers and AWS, leveraging GPU resources for efficient inference and training.

gpu-computing status-active lang-python generative-ai gpu-inference ai-ml-portfolio scope-medium framework-numpy framework-matplotlib framework-pandas framework-huggingface framework-pytorch framework-scikit-learn framework-cuda ai-ml-portfolio-llm gpu-scientific gpu-deep-learning gpu-llm

Updated May 19, 2026
Python

jkoenig72 / hal9000-tts

Star

HAL 9000 voice cloning with Qwen3-TTS streaming - OpenAI-compatible REST API

python text-to-speech cuda tts speech-synthesis voice-assistant privacy-first hal9000 voice-cloning fastapi german-tts streaming-tts gpu-inference openai-compatible local-tts real-time-tts qwen3-tts

Updated Apr 12, 2026
Python

pradhankukiran / vox-populi

Star

Personal text-to-speech webapp powered by VoxCPM2 — voice design, controllable cloning, and ultimate cloning. Next.js on Vercel + Modal GPU.

python text-to-speech typescript ai nextjs modal tts voice-synthesis tailwindcss voice-cloning fastapi huggingface shadcn-ui gpu-inference nextjs16 nano-vllm openbmb voxcpm2

Updated May 14, 2026
TypeScript

rpathai7-netizen / dnsts-cascade

Star

4-tier asynchronous LLM cascade system achieving 120 tokens/sec on constrained hardware using llama.cpp, speculative decoding, and GPU+CPU parallel inference

windows multi-agent quantization cpu-inference large-language-models llama-cpp gpu-inference local-ai gguf speculative-decoding

Updated Apr 5, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-inference

Here are 35 public repositories matching this topic...

brontoguana / krasis

aws-samples / sample-genai-on-eks-starter-kit

redbco / infermesh

Talnz007 / VulkanIlm

youngharold / tightwad

Armaggheddon / ClipServe

grctest / fastapi-gemma-translate

LLMSystems / TensorrtServer

deapi-ai / deapi-tester

mpaepper / docker-cifar

caimari / vtts

jozsefszalma / intranet_image_generator

FurkanAtass / Scalable-ML-Inference-Eks

manishklach / tlb-invalidation-lab

paralleliq / modelspec

hkevin01 / NeuroAccel

hkevin01 / Llama-GPU

jkoenig72 / hal9000-tts

pradhankukiran / vox-populi

rpathai7-netizen / dnsts-cascade

Improve this page

Add this topic to your repo