PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
-
Updated
Jul 5, 2024 - C++
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
minimal C implementation of speculative decoding based on llama2.c
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Official Implementation of EAGLE
Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
scalable and robust tree-based speculative decoding algorithm
Reproducibility Project for [NeurIPS'23] Speculative Decoding with Big Little Decoder
Dynasurge: Dynamic Tree Speculation for Prompt-Specific Decoding
Verification of the effect of speculative decoding in Japanese.
[NeurIPS'23] Speculative Decoding with Big Little Decoder
Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)
Some experiments aimed at increasing LLM throughput and efficiency via Speculative Decoding.
Add a description, image, and links to the speculative-decoding topic page so that developers can more easily learn about it.
To associate your repository with the speculative-decoding topic, visit your repo's landing page and select "manage topics."