๐ Interactive web-based demonstration of speculative decoding in large language models (LLMs)
This project provides a comprehensive, educational demonstration of how speculative decoding works to accelerate text generation in AI models. Watch as a fast small model (DistilGPT-2) drafts multiple tokens ahead, while a larger accurate model (GPT-2) verifies them, providing significant speedup while maintaining quality.
Experience speculative decoding in action with three different interfaces:
- Main Demo: Interactive visualization with color-coded tokens
- Speed Comparison: Side-by-side performance race with detailed analytics
- Simple Comparison: Clean, no-frills performance comparison
- Real-time token generation with streaming text
- Visual feedback showing accepted (green) vs rejected (red) tokens
- Performance metrics demonstrating 2-3x speedup
- Model call reduction showing computational savings
- Educational insights into modern AI optimization techniques
- Interactive UI: Clean, responsive interface with real-time token visualization
- Visual Feedback: Color-coded tokens showing acceptance/rejection status
- ๐ข Green: Tokens accepted by large model
- ๐ด Red: Tokens rejected and corrected
- ๐ก Yellow: Drafted tokens awaiting verification
- Real-time Statistics: Track efficiency, model calls saved, and acceptance rates
- Mock Mode: Fast testing without loading heavy AI models
- Configurable Parameters: Adjust number of speculative tokens (k)
- Generation Log: Detailed step-by-step process visualization
- Side-by-Side Comparison: Run speculative vs sequential generation simultaneously
- Performance Metrics: Real-time speed, efficiency, and model call tracking
- Visual Race: Watch both methods generate text with progress bars
- Detailed Analysis: Comprehensive performance breakdown and speedup calculation
- Winner Declaration: Clear indication of which method performs better
- Token Streaming: Real-time token generation with visual feedback
- No Fancy Visualizations: Plain text output as it would appear normally
- Raw Performance Data: Clean comparison without animations or color coding
- Focus on Results: Emphasis on timing, efficiency, and model call statistics
- Straightforward Interface: Minimal UI focused on the core comparison
- Real Model Support: Full integration with actual AI models
- Streaming Support: Optional real-time token streaming as text is generated
- Draft Phase: Small model (DistilGPT-2) generates k=4 tokens ahead
- Verification Phase: Large model (GPT-2) verifies the drafted tokens
- Acceptance/Rejection: Tokens are accepted if they match, rejected if they don't
- Fallback: If tokens are rejected, large model generates the correct continuation
- Efficiency Gain: Saves multiple large model calls when tokens are accepted
- Frontend: HTML5, CSS3, Vanilla JavaScript
- AI Models: Transformers.js (@xenova/transformers) loaded via CDN
- Small Model: Xenova/distilgpt2
- Large Model: Xenova/gpt2
- Styling: Custom CSS with gradient backgrounds and animations
- Package Manager: pnpm
- Module Loading: Direct CDN imports to avoid browser compatibility issues
- Node.js 16+ - Download here
- pnpm - Fast, disk space efficient package manager
npm install -g pnpm
-
Clone the repository
git clone https://github.com/gourav221b/Speculative-decoding-WebAI-demo.git cd Speculative-decoding-WebAI-demo -
Install dependencies
pnpm install
-
Start the development server
pnpm run dev
-
Open your browser
- Navigate to
http://localhost:3000 - Start with the main demo or try the comparisons
- Navigate to
- Start with Mock Mode - Instant results for immediate testing
- Try the Main Demo - See color-coded token visualization
- Run Speed Comparison - Watch the performance race
- Load Real Models - Experience authentic AI model performance (optional)
- Start with Mock Mode: For instant results, keep "Mock mode" checked
- Enter a Prompt: Type your text prompt (e.g., "The quick brown fox")
- Adjust Parameters: Set the number of speculative tokens (k) - default is 4
- Generate: Click the "Generate" button to start the speculative decoding process
- Watch the Process: Observe tokens being drafted (yellow), then accepted (green) or rejected (red)
- View Statistics: Monitor efficiency rates and model calls saved
- Navigate to Comparison: Click "๐ Speed Comparison" in the footer
- Enter Prompt: Type the same prompt for both methods to compare
- Set Parameters: Adjust target length and speculative tokens (k)
- Start Race: Click "Start Comparison" to run both methods simultaneously
- Watch the Race: See real-time progress bars and token generation
- View Results: Analyze speedup, efficiency, and detailed performance metrics
- Navigate to Simple: Click "๐ Simple Comparison" in the footer
- Enter Prompt: Type your prompt for both methods
- Configure Settings: Set target tokens and speculative parameters
- Load Models: Optionally load real AI models or use mock mode
- Start Comparison: Run both methods without fancy visualizations
- Review Results: See plain text output and performance statistics
To use actual AI models instead of mock generation:
- Uncheck "Mock mode"
- Wait for models to load (first time may take a few minutes)
- Generate text with real DistilGPT-2 and GPT-2 models
โโโ index.html # Main interactive demo with visual tokens
โโโ comparison.html # Fancy side-by-side speed comparison
โโโ simple-comparison.html # Clean performance comparison
โโโ styles.css # Professional styling and animations
โโโ script.js # Main demo logic with speculative decoding
โโโ comparison.js # Speed comparison functionality
โโโ simple-comparison.js # Simple comparison logic
โโโ script-compatible.js # Fallback compatible version
โโโ package.json # Dependencies and scripts
โโโ pnpm-lock.yaml # Package lock file
โโโ .gitignore # Git ignore rules
โโโ README.md # Project documentation
| Page | Description | Best For |
|---|---|---|
index.html |
Interactive visualization with color-coded tokens | Learning and presentations |
comparison.html |
Animated side-by-side race with rich visuals | Demonstrations and education |
simple-comparison.html |
Clean text-only performance comparison | Technical analysis and research |
- Model loading and management
- Token generation with both models
- Speculative decoding algorithm implementation
- UI interaction and visual updates
- Animated token streaming
- Real-time color coding
- Progress tracking
- Statistics dashboard
- Generation logging
Speculative decoding can provide significant speedup:
- Theoretical Speedup: Up to 2-3x faster generation
- Efficiency Tracking: Real-time monitoring of acceptance rates
- Model Call Reduction: Fewer expensive large model calls
- Speculative Decoding Mechanics: See exactly how small models draft and large models verify
- Performance Optimization: Understand why this technique provides 2-3x speedup
- Token-Level Processing: Visualize how AI models generate text token by token
- Model Efficiency: Learn about computational trade-offs in AI systems
- Real-World Applications: Understand how modern AI systems achieve faster inference
- Acceptance Rates: Typically 60-80% of drafted tokens are accepted
- Model Call Reduction: 30-50% fewer expensive large model calls
- Latency Benefits: Significant reduction in time-to-first-token
- Quality Maintenance: Same output quality as sequential generation
- Scalability: Benefits increase with larger model size differences
- Students learning about AI optimization techniques
- Developers implementing speculative decoding
- Researchers studying LLM inference optimization
- Educators teaching modern AI concepts
- Engineers optimizing AI application performance
- Modern browsers with ES6+ support
- Chrome, Firefox, Safari, Edge (latest versions)
- Mobile responsive design
- Requires internet connection for CDN-based module loading
If you see "Failed to resolve module specifier @xenova/transformers":
- Check Browser Support: Ensure you're using a modern browser
- Internet Connection: The app loads transformers.js from CDN
- HTTPS: Some browsers require HTTPS for ES6 modules
- View Troubleshooting Page: Open
troubleshooting.htmlfor detailed solutions
- Slow Loading: First-time model loading can take 2-3 minutes
- Memory Usage: AI models require significant RAM (2GB+ recommended)
- CORS Errors: Use the provided development server, not file:// protocol
pnpm run dev- Start development serverpnpm run start- Start production serverpnpm run build- No build step (static files)
Open test.html in your browser to run the test suite.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
MIT License - Feel free to use and modify for educational and commercial purposes.
- OpenAI for pioneering speculative decoding research
- Hugging Face for Transformers.js and model hosting
- Xenova for the excellent browser-compatible AI models
- The AI Community for advancing LLM optimization techniques
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Contact: Create an issue for questions or feedback
โญ Star this repository if you found it helpful!
Built with โค๏ธ for the AI education community