feat: RAG API optimizations with intelligent ha/backup embedding #197

dirkpetersen · 2025-08-24T04:39:46Z

This comprehensive update adds intelligent backup embedding providers, performance optimizations, and comprehensive error handling:

After trying to upload more than 200 PDF docs in one go, I ran into some limitations, for example rate limits on AWS and Azure for their embedding models (#187). If LibreChat is installed on-premises, the cloud based embedders can add significant latency. The solutions was for us to install an Nvidia NIM with one of their embedders on a local GPU and if there is any error or if that server is not reachable, fall back to any of the cloud providers (AWS Bedrock example provided.)

🚀 New Features

Intelligent Backup Embedding System

Ultra-fast failover: Socket check detects dead ports in 0.5 seconds
Immediate failover: Primary failure triggers instant backup attempt (no retries)
Smart cooldown: 1-minute cooldown after primary provider failure
Seamless switching: LibreChat receives 200 status when backup succeeds
Fast recovery: Optimized retry logic prevents cascading failures
Clear logging: Prominent failure messages and accurate provider tracking

Custom NVIDIA Embeddings Provider

Full NVIDIA API compatibility for LLaMA embedding models
Fast port detection: Socket check fails immediately if nothing listening
Optimized timeouts: 0.5s socket check, 2s connection, 3s read timeout
Configurable parameters: batch size, retries, timeout, input types
Fast failover mode: Reduced retries when backup provider configured
Proper error handling for NVIDIA-specific API responses

Enhanced AWS Bedrock Support

Titan V2 embeddings with configurable dimensions (256/512/1024)
Reactive rate limiting - only activates when AWS throttles requests
Graceful error handling with user-friendly configuration messages
Backward compatibility with Titan V1 models

Database & Performance Optimizations

Graceful PostgreSQL error handling - 503 responses for connection issues
Optimized chunking strategy - adaptive batch sizes based on chunk size
Request throttling middleware - prevents LibreChat overload (configurable)
Improved UTF-8 file processing with proper cleanup and null checks
Enhanced connection pooling with optimized timeout settings

📋 Configuration

Backup Provider Setup

# Primary Provider
EMBEDDINGS_PROVIDER=nvidia
EMBEDDINGS_MODEL=nvidia/llama-3.2-nemoretriever-300m-embed-v1
NVIDIA_TIMEOUT=3  # Fast failover - 3 second read timeout

# Backup Provider
EMBEDDINGS_PROVIDER_BACKUP=bedrock
EMBEDDINGS_MODEL_BACKUP=amazon.titan-embed-text-v2:0
PRIMARY_FAILOVER_COOLDOWN_MINUTES=1

# Performance Tuning
EMBED_CONCURRENCY_LIMIT=3

Bedrock Titan V2 Configuration

BEDROCK_EMBEDDING_DIMENSIONS=512  # 256, 512, or 1024
BEDROCK_EMBEDDING_NORMALIZE=true
BEDROCK_MAX_BATCH_SIZE=15

🧪 Testing

31 comprehensive tests covering V1/V2 compatibility
Error simulation and recovery testing
Integration tests for backup failover scenarios

🛠️ Technical Improvements

Ultra-fast port detection - Socket check with 0.5s timeout before connection
Immediate failover logic - no retry delays when backup is available
Triple-layer timeout strategy - socket (0.5s), connection (2s), read (3s)
Conditional AWS credential loading - only when Bedrock is configured
Thread-safe state management with proper locking
Pydantic v2 compatibility with proper field declarations
Comprehensive error categorization and user-friendly messages

📚 Documentation

Complete environment variable documentation in README.md
High availability configuration examples with NVIDIA + Bedrock setup
Detailed provider configuration guides for all supported embedding services

This update ensures robust, production-ready embedding operations with lightning-fast failover (0.5-3 seconds), optimal performance, and excellent user experience.

🤖 Generated with Claude Code

danny-avila · 2025-08-25T03:42:40Z

is it possible to add each of the changes piecemeal to focus on what they do individually?

🤖 Generated with Claude Code

Also I'm hesitant to add something generated by Claude without some validation on what it's doing, some examples of manual testing, etc.

…eddings This comprehensive update adds intelligent backup embedding providers, performance optimizations, and comprehensive error handling: ## 🚀 New Features ### Intelligent Backup Embedding System - **Ultra-fast failover**: Socket check detects dead ports in 0.5 seconds - **Immediate failover**: Primary failure triggers instant backup attempt (no retries) - **Smart cooldown**: 1-minute cooldown after primary provider failure - **Automatic recovery detection**: Tests primary recovery when both providers fail - **Seamless switching**: LibreChat receives 200 status when backup succeeds - **Fast recovery**: Optimized retry logic prevents cascading failures - **Clear logging**: Prominent failure messages and accurate provider tracking ### Custom NVIDIA Embeddings Provider - **Full NVIDIA API compatibility** for LLaMA embedding models - **Fast port detection**: Socket check fails immediately if nothing listening - **Optimized timeouts**: 0.5s socket check, 2s connection, 3s read timeout - **Configurable parameters**: batch size, retries, timeout, input types - **Fast failover mode**: Reduced retries when backup provider configured - **Proper error handling** for NVIDIA-specific API responses ### Enhanced AWS Bedrock Support - **Titan V2 embeddings** with configurable dimensions (256/512/1024) - **Optimized timeouts**: 5s connection, 30s read (reduced from 60s default) - **Reactive rate limiting** - only activates when AWS throttles requests - **Graceful error handling** with user-friendly configuration messages - **Backward compatibility** with Titan V1 models ### Database & Performance Optimizations - **Graceful PostgreSQL error handling** - 503 responses for connection issues - **Optimized chunking strategy** - adaptive batch sizes based on chunk size - **Request throttling middleware** - prevents LibreChat overload (configurable) - **Improved UTF-8 file processing** with proper cleanup and null checks - **Enhanced connection pooling** with optimized timeout settings ## 🧪 Comprehensive Testing Suite - **59 passing unit tests** covering all functionality - **Automated failover testing** with service interruption simulation - **JWT authentication integration** matching LibreChat's auth flow - **Automatic document cleanup** after testing - **Configurable test environments** via environment variables ## 📋 Configuration ### Backup Provider Setup ```env # Primary Provider EMBEDDINGS_PROVIDER=nvidia EMBEDDINGS_MODEL=nvidia/llama-3.2-nemoretriever-300m-embed-v1 NVIDIA_TIMEOUT=3 # Fast failover - 3 second read timeout # Backup Provider EMBEDDINGS_PROVIDER_BACKUP=bedrock EMBEDDINGS_MODEL_BACKUP=amazon.titan-embed-text-v2:0 PRIMARY_FAILOVER_COOLDOWN_MINUTES=1 # Performance Tuning EMBED_CONCURRENCY_LIMIT=3 ``` ### Bedrock Titan V2 Configuration ```env BEDROCK_EMBEDDING_DIMENSIONS=512 # 256, 512, or 1024 BEDROCK_EMBEDDING_NORMALIZE=true BEDROCK_MAX_BATCH_SIZE=15 ``` ## 🛠️ Technical Improvements - **Ultra-fast port detection** - Socket check with 0.5s timeout before connection - **Immediate failover logic** - no retry delays when backup is available - **Triple-layer timeout strategy** - socket (0.5s), connection (2s), read (3s) - **Automatic recovery detection** - checks primary when both providers fail - **Optimized Bedrock timeouts** - 5s connection, 30s read for faster failover - **Conditional AWS credential loading** - only when Bedrock is configured - **Thread-safe state management** with proper locking - **Pydantic v2 compatibility** with proper field declarations - **Comprehensive error categorization** and user-friendly messages ## 📚 Documentation - **Complete environment variable documentation** in README.md - **High availability configuration examples** with NVIDIA + Bedrock setup - **Detailed provider configuration guides** for all supported embedding services - **Timeout optimization documentation** for production deployments - **Comprehensive testing documentation** with automated and manual testing procedures This update ensures robust, production-ready embedding operations with lightning-fast failover (0.5-35 seconds), optimal performance, and excellent user experience. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

dirkpetersen · 2025-08-27T15:05:03Z

Yes, that caution is understandable. I changed that one test case that made github actions fail and there is now a new script that allows me to automatically upload a bunch of pdfs and tests the failover using iptables which then shows in the logs.

(venv) ochat@ochat:~/rag_api$ sudo -v && python3 test_failover_automation.py
🚀 RAG API Failover Automation Test
This script will upload PDFs while toggling NVIDIA service availability
Press Ctrl+C to stop the test at any time

✅ Sudo access confirmed for iptables commands
[07:09:20] 🎯 Starting failover test with 28 PDFs
[07:09:20] 📁 PDF directory: /home/ochat/uploads_rag/pdfs
[07:09:20] 🌐 RAG API URL: http://localhost:8001
[07:09:20] 🧹 Cleaning up iptables rules...
[07:09:20]    ✅ NVIDIA port 8003 is now unblocked
[07:09:20] 🚫 Blocking NVIDIA port 8003...
[07:09:20]
📋 Progress: 1/28
[07:09:20] 📄 Uploading 2301.00013.pdf (ID: test-230...)
[07:09:20]    ⏳ Port blocked for 12 seconds
[07:09:22]    ✅ SUCCESS: 2301.00013.pdf uploaded in 1.2s
[07:09:22]    ⏳ Waiting 5s before next upload...
[07:09:27]
📋 Progress: 2/28
[07:09:27] 📄 Uploading 2301.00028.pdf (ID: test-230...)
[07:09:32]    ✅ SUCCESS: 2301.00028.pdf uploaded in 5.3s
[07:09:32]    ⏳ Waiting 5s before next upload...
[07:09:32] ✅ Unblocking NVIDIA port 8003...
[07:09:32]    ⚠️ iptables command failed: Command '['sudo', 'iptables', '-D', 'OUTPUT', '-p', 'tcp', '--dport', '8003', '-j', 'REJECT']' returned non-zero exit status 1.
[07:09:37]
📋 Progress: 3/28
[07:09:37] 📄 Uploading 2301.00003.pdf (ID: test-230...)
[07:09:37] 🚫 Blocking NVIDIA port 8003...
[07:09:37]    ⏳ Port blocked for 11 seconds
[07:09:38]    ✅ SUCCESS: 2301.00003.pdf uploaded in 0.9s
[07:09:38]    ⏳ Waiting 3s before next upload...
[07:09:41]
📋 Progress: 4/28
[07:09:41] 📄 Uploading 2301.00002.pdf (ID: test-230...)
[07:09:47]    ✅ SUCCESS: 2301.00002.pdf uploaded in 6.7s
[07:09:47]    ⏳ Waiting 8s before next upload...
[07:09:48] ✅ Unblocking NVIDIA port 8003...
[07:09:48]    ⏳ Port unblocked for 17 seconds
[07:09:55]
📋 Progress: 5/28
[07:09:55] 📄 Uploading 2301.00007.pdf (ID: test-230...)
[07:10:05]    ✅ SUCCESS: 2301.00007.pdf uploaded in 9.6s
[07:10:05]    ⏳ Waiting 3s before next upload...
[07:10:05] 🚫 Blocking NVIDIA port 8003...
[07:10:05]    ⏳ Port blocked for 13 seconds

log output:

Aug 27 07:09:47 ochat uvicorn[93456]: 2025-08-27 07:09:47,969 - root - INFO - Processing embed request for /embed
Aug 27 07:09:47 ochat uvicorn[93456]: 2025-08-27 07:09:47,981 - root - INFO - Processing embed request for file_id=test-2301.00020-1756303787, filename=2301.00020.pdf, user_id=test-failover-user
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,243 - root - INFO - Processing file test-2301.00020-1756303787: split into 23 chunks with size 1500
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,245 - root - INFO - Starting embedding for file test-2301.00020-1756303787: 23 document chunks
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,245 - root - INFO - Starting NVIDIA embedding for 23 texts
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,247 - root - WARNING - NVIDIA service not responding on dgx01.arcs.oregonstate.edu:8003
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,247 - root - ERROR - Unexpected error in NVIDIA embedding: NVIDIA service not available on http://dgx01.arcs.oregonstate.edu:8003/v1/embeddings (port not listening)
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,247 - root - WARNING - ❌ PRIMARY PROVIDER FAILED: nvidia:nvidia/llama-3.2-nemoretriever-300m-embed-v1 - NVIDIA embedding failed after 1 attempts: NVIDIA service not available on http://dgx01.arcs.oregonstate.edu:8003/v1/embeddings (port not listening)
Aug 27 07:09:48 ochat uvicorn[93456]: 2025-08-27 07:09:48,247 - root - INFO - 🔄 Immediately trying backup provider bedrock:amazon.titan-embed-text-v2:0 after primary failure
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,005 - root - INFO - Successfully embedded 23 documents
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,005 - root - INFO - ✅ Backup provider bedrock:amazon.titan-embed-text-v2:0 succeeded after primary failure
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,006 - root - INFO - Successfully embedded 23 texts (bedrock:amazon.titan-embed-text-v2:0 - backup active) in 1.76s
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,035 - root - INFO - Completed embedding for file test-2301.00020-1756303787 in 1.79s
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,036 - root - INFO - Successfully completed embed request for file_id=test-2301.00020-1756303787, filename=2301.00020.pdf
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,037 - root - INFO - Request POST http://localhost:8001/embed - 200
Aug 27 07:09:50 ochat uvicorn[93456]: 2025-08-27 07:09:50,038 - root - INFO - Completed embed request for /embed
Aug 27 07:09:55 ochat uvicorn[93456]: 2025-08-27 07:09:55,906 - root - INFO - Acquiring embed semaphore for /embed (limit: 3)

dirkpetersen · 2025-08-27T15:25:42Z

@danny-avila, these are the individual changes by file, they are a lot of small changes that improve robustness / performance. I installed this service as a venv under user level systemd and let Claude parse the output and code along all day while I was testing a number of failure scenarios. This is meant to work with on-premises services on an AI/HPC cluster that are only running when nobody is using the hardware for other purposes, here is a breakdown by file. Should I add more extensive code comments ?

  🔧 Core Application Files:
  
  1. .gitignore

  What: Added .env.beta and CLAUDE.md
  Why: Prevent sensitive config files and Claude Code instructions from being committed

  2. README.md

  What: Added comprehensive documentation for new features
  Why: Document all new environment variables, backup provider configuration, NVIDIA settings, testing procedures, and arXiv PDF download script
  for users

  3. app/config.py

  What: Added NVIDIA provider, backup provider initialization, conditional AWS credential loading, optimized Bedrock timeouts
  Why: Support new embedding providers, intelligent failover system, and faster timeout handling for production use

  4. app/middleware.py

  What: Changed default EMBED_CONCURRENCY_LIMIT from 2 to 3
  Why: Optimize performance based on testing - handle more concurrent requests without overwhelming the system

  5. app/routes/document_routes.py

  What: Added graceful PostgreSQL connection error handling with 503 responses instead of 500s
  Why: Provide user-friendly error messages for temporary database issues and reduce log noise

  🚀 New Embedding Provider Files:

  6. app/services/embeddings/__init__.py

  What: Empty init file
  Why: Make the embeddings directory a proper Python package for imports

  7. app/services/embeddings/backup_embeddings.py

  What: Complete intelligent backup provider with cooldown, socket checks, immediate failover
  Why: Core functionality for seamless primary→backup→recovery cycle with 0.5s detection

  8. app/services/embeddings/bedrock_rate_limited.py

  What: Bedrock provider with Titan V2 support, reactive rate limiting, configurable dimensions
  Why: Enhanced AWS Bedrock integration with modern V2 features and intelligent rate limiting only when needed

  9. app/services/embeddings/nvidia_embeddings.py

  What: Custom NVIDIA provider with socket checks, optimized timeouts, proper API format handling
  Why: Support on-premises LLaMA embeddings with ultra-fast failover detection and NVIDIA-specific API requirements

  📊 Database & Infrastructure Files:

  10. app/services/database.py

  What: Enhanced connection pooling and optimization (from embed branch)
  Why: Improved database performance for high-throughput document processing

  11. app/services/vector_store/async_pg_vector.py

  What: Async vector store optimizations (from embed branch)Why: Better performance for concurrent embedding operations

  12. app/utils/document_loader.py

  What: Fixed UTF-8 file cleanup with null checks to prevent "NoneType" warnings
  Why: Eliminate annoying warnings during document processing

  13. main.py

  What: Application startup optimizations (from embed branch)
  Why: Improved initialization and lifecycle management

  🧪 Testing Files:

  14. test_failover_automation.py

  What: Complete automated testing script with JWT auth, service toggling, document cleanup
  Why: Validate failover system performance with real-world interruption scenarios (proven 100% success rate)

  15. tests/services/test_bedrock_embeddings.py

  What: 22 comprehensive tests for Bedrock Titan V1/V2 functionalityWhy: Ensure backward compatibility and validate V2 features (dimensions,
  normalization, error handling)

  16. tests/test_failover_mock.py

  What: 8 mock tests for backup provider and socket check functionality
  Why: CI/CD compatible tests that verify failover logic without requiring real services

  17. tests/test_titan_v2_integration.py

  What: Integration tests for performance optimizations and middlewareWhy: Validate system performance under load and configuration changes

  18. tests/test_main.py (modified)

  What: Updated mock functions to handle new batch_size parameter and fixed concurrency limit test
  Why: Maintain test compatibility with new vector store batch processing functionality

dirkpetersen · 2025-09-13T13:41:41Z

@danny-avila did you want multiple commits or multiple pull requests for this change ? It's been running well in production for about 3 weeks now

dirkpetersen force-pushed the embed-clean branch 2 times, most recently from 68c4554 to 19c5058 Compare August 27, 2025 05:00

dirkpetersen force-pushed the embed-clean branch from 19c5058 to 02f0a83 Compare August 27, 2025 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: RAG API optimizations with intelligent ha/backup embedding #197

feat: RAG API optimizations with intelligent ha/backup embedding #197

Uh oh!

dirkpetersen commented Aug 24, 2025

Uh oh!

danny-avila commented Aug 25, 2025

Uh oh!

dirkpetersen commented Aug 27, 2025 •

edited

Loading

Uh oh!

dirkpetersen commented Aug 27, 2025 •

edited

Loading

Uh oh!

dirkpetersen commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: RAG API optimizations with intelligent ha/backup embedding #197

Are you sure you want to change the base?

feat: RAG API optimizations with intelligent ha/backup embedding #197

Uh oh!

Conversation

dirkpetersen commented Aug 24, 2025

🚀 New Features

Intelligent Backup Embedding System

Custom NVIDIA Embeddings Provider

Enhanced AWS Bedrock Support

Database & Performance Optimizations

📋 Configuration

Backup Provider Setup

Bedrock Titan V2 Configuration

🧪 Testing

🛠️ Technical Improvements

📚 Documentation

Uh oh!

danny-avila commented Aug 25, 2025

Uh oh!

dirkpetersen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirkpetersen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirkpetersen commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dirkpetersen commented Aug 27, 2025 •

edited

Loading

dirkpetersen commented Aug 27, 2025 •

edited

Loading