RAGVenture

RAGVenture is an intelligent startup idea generator powered by Retrieval-Augmented Generation (RAG). It helps entrepreneurs generate innovative startup ideas by learning from successful companies, combining the power of large language models with real-world startup data.

Why RAGVenture?

Traditional startup ideation tools either rely on expensive API calls or generate ideas without real-world context. RAGVenture solves this by:

Completely FREE: Runs entirely on your machine with no API costs - zero API keys required!
Smart Model Management: Automatically handles model deprecation and failures with intelligent fallback
Data-Driven: Learns from real startup data to ground suggestions in reality
Context-Aware: Understands patterns from successful startups
Intelligent: Uses RAG to combine LLM capabilities with precise information retrieval
Resilient: Works offline with local models when external APIs are unavailable
Production-Ready: 177 tests with comprehensive coverage, Docker runtime fixes, and monitoring

System Requirements

Python 3.11 or higher
8GB RAM minimum (16GB recommended)
2GB disk space for models and data
Operating Systems:
- Linux (recommended)
- macOS
- Windows (with WSL for best performance)

Quick Start

Installation:

# Clone the repository
git clone https://github.com/valginer0/RAGVenture.git
cd RAGVenture

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On Unix or MacOS:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install spaCy language model for market analysis
python -m spacy download en_core_web_sm

Environment Setup (Optional - system works completely FREE without any setup!):

# Optional: HuggingFace token for enhanced remote models (system works completely FREE without it)
export HUGGINGFACE_TOKEN="your-token-here"  # Get from huggingface.co

# Smart model management (enabled by default)
export RAG_SMART_MODELS=true
export RAG_MODEL_CHECK_INTERVAL=3600
export RAG_MODEL_TIMEOUT=60

# Optional: LangChain tracing (debugging)
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY="your-langsmith-api-key"
export LANGCHAIN_PROJECT="your-project-name"

Generate Ideas:

# Generate 3 startup ideas in the AI domain
python -m rag_startups.cli generate-all "AI" --num-ideas 3

# Generate ideas without market analysis
python -m rag_startups.cli generate-all "fintech" --num-ideas 2 --no-market

# Check model health and status
python -m rag_startups.cli models status

# Use custom startup data file
python -m rag_startups.cli generate-all "education" --file custom_startups.json

Features & Capabilities

Core Features

Intelligent Idea Generation:
- Uses RAG to combine LLM knowledge with real startup data
- Generates contextually relevant and grounded ideas
- Provides structured output with problem, solution, and market analysis

Command-Line Interface

Commands:

generate-all: Generate startup ideas with market analysis
- Required argument: Topic or domain (e.g., "AI", "fintech")
- Options:
  - --num-ideas: Number of ideas (1-5, default: 1)
  - --file: Custom startup data file (default: yc_startups.json)
  - --market/--no-market: Include/exclude market analysis
  - --temperature: Model creativity (0.0-1.0)
  - --print-examples: Show relevant examples

Smart Model Management

Automatic Fallback: Falls back to local models when external APIs fail
Model Migration Intelligence: Handles model deprecation (e.g., Mistral v0.2→v0.3) automatically
Health Monitoring: Continuous model health checks and status reporting
Local Resilience: Works completely offline with local models
CLI Management: models command for status, testing, and diagnostics

Technical Features

Smart Analysis:
- Semantic search for relevant examples
- Automatic metadata extraction
- Pattern recognition from successful startups
Performance Optimized:
- One-time embedding generation (~22s)
- Fast idea generation (~0.5s per idea)
- Efficient data processing (~0.1s load time)
Production Quality:
- 31 comprehensive unit tests
- Automated code formatting
- Extensive error handling

Performance

Typical processing times on a standard machine:

Initial Setup: ~22s (one-time embedding generation)
Data Loading: ~0.1s
Idea Generation: ~0.5s per idea

Docker Support

For containerized deployment, we provide both CPU and GPU support.

Prerequisites

Docker and Docker Compose
For GPU support:
- NVIDIA GPU with CUDA
- NVIDIA Container Toolkit
- nvidia-docker2

Quick Start with Docker

# CPU Version (recommended - fully tested)
docker-compose up app-cpu

# GPU Version (with NVIDIA support)
docker-compose up app-gpu

# Run with custom data file
docker-compose run --rm app-cpu python -m rag_startups.cli generate-all fintech --num-ideas 1 --file /app/yc_startups.json

Docker Status: ✅ Production Ready - All runtime issues resolved, works end-to-end with real data.

Run from GitHub Container Registry (GHCR)

If you prefer pulling a prebuilt image from GHCR:

Login to GHCR (needs a GitHub token). Replace USERNAME/TOKEN accordingly.

echo "$GITHUB_TOKEN" | docker login ghcr.io -u <USERNAME> --password-stdin
# or
echo "<TOKEN>" | docker login ghcr.io -u <USERNAME> --password-stdin

Pull the image:

docker pull ghcr.io/valginer0/rag_startups:0.9.2
# or latest if available
docker pull ghcr.io/valginer0/rag_startups:latest

Run the CLI (using your local .env for tokens/settings):

docker run --rm -it \
  --env-file .env \
  ghcr.io/valginer0/rag_startups:0.9.2 \
  python -m rag_startups.cli generate-all "AI" --num-ideas 2

Tip: For offline/deterministic runs, set HUGGINGFACE_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 in your .env.

Maintainers – Publish to GHCR:

# After building locally (e.g., docker build -t ghcr.io/valginer0/rag_startups:dev .)
docker tag ghcr.io/valginer0/rag_startups:dev ghcr.io/valginer0/rag_startups:0.9.2
docker push ghcr.io/valginer0/rag_startups:0.9.2
docker tag ghcr.io/valginer0/rag_startups:0.9.2 ghcr.io/valginer0/rag_startups:latest
docker push ghcr.io/valginer0/rag_startups:latest

Development Setup

Clone and setup:

git clone https://github.com/valginer0/RAGVenture.git
cd RAGVenture
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install development dependencies:

pip install -r requirements.txt
pre-commit install  # Sets up automatic code formatting

Run tests:

pytest tests/  # Should show 178 passing tests

Testing & Offline Policy

This project enforces fully offline, deterministic tests:

Tests block outbound HTTP(S) by default via an autouse fixture in tests/conftest.py that patches requests.sessions.Session.request.
Autouse fixtures also mock model-loading/network paths:
- huggingface_hub.model_info in rag_startups/cli.py preflight
- transformers.pipeline at all call sites (e.g., rag_startups.embed_master, rag_startups.core.rag_chain, CLI)
- huggingface_hub.InferenceClient and the bound imports used by rag_startups/idea_generator/generator.py
- rag_startups.embed_master.calculate_result is replaced with a deterministic helper during tests
Offline env vars are forced: HUGGINGFACE_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1.
To explicitly allow network in a specific test, add the marker: @pytest.mark.allow_network.

Runtime (non-test) CLI runs are allowed to use the network and will honor your .env.

Data Requirements

RAGVenture works with startup data in JSON format. Two options:

Use YC Data (Recommended):
- Download from Y Combinator
- Convert CSV to JSON:
```
python -m rag_startups.data.convert_yc_data input.csv -o startups.json
```
Use Custom Data:
- Prepare JSON file with required fields
- See docs/data_format.md for schema

Troubleshooting

Embedding Generation Time:
- First run takes ~22s to generate embeddings
- Subsequent runs use cached embeddings
- GPU can significantly speed up this process
Common Issues:
- Missing HUGGINGFACE_TOKEN: Sign up at huggingface.co
- Memory errors: Reduce batch size with --max-lines
- GPU errors: Ensure CUDA toolkit is properly installed

Documentation

docs/api.md: API documentation
docs/examples.md: Usage examples
docs/data_format.md: Data schema
CONTRIBUTING.md: Development guidelines

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

This project is licensed under the MIT License - see LICENSE for details.

Startup Names and Legal Considerations

Name Generation

Each generated startup name includes a unique identifier (e.g., "TechStartup-x7y9z")
This identifier ensures technical uniqueness within the tool
The unique identifier is NOT a substitute for legal name verification

Important Notes for Users

Generated names are suggestions only
The uniqueness of a name at generation time does not guarantee its availability
Users must perform their own due diligence before using any name

Name Verification Resources

USPTO Trademark Database: https://www.uspto.gov/trademarks
State Business Registries
Domain Name Availability Tools
Professional Legal Counsel

Future Features

Name availability checking tool (planned)
Integration with business registry APIs

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
.idea		.idea
.release-notes		.release-notes
config		config
data		data
docs		docs
src		src
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
=4.0.0		=4.0.0
ARCHITECTURE_REVIEW.md		ARCHITECTURE_REVIEW.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
Dockerfile.test		Dockerfile.test
INDEPENDENT_REVIEW.md		INDEPENDENT_REVIEW.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
ROADMAP.md		ROADMAP.md
canvas_suggested_example.py		canvas_suggested_example.py
docker-compose.yml		docker-compose.yml
embed_master.py		embed_master.py
market_analysis_demo.py		market_analysis_demo.py
mkdocs.yml		mkdocs.yml
parse_junit_times.py		parse_junit_times.py
parse_top50.py		parse_top50.py
pyproject.toml		pyproject.toml
rag_startup_ideas.py		rag_startup_ideas.py
requirements.txt		requirements.txt
setup.py		setup.py
test_embedding.py		test_embedding.py
test_performance.py		test_performance.py
tests_durations_caching.xml		tests_durations_caching.xml
tests_durations_full.txt		tests_durations_full.txt
tests_durations_full.xml		tests_durations_full.xml
tests_durations_generators.xml		tests_durations_generators.xml
tests_durations_junit.xml		tests_durations_junit.xml
tests_durations_only.txt		tests_durations_only.txt
tests_durations_top50_precise.txt		tests_durations_top50_precise.txt
yc_startups.json		yc_startups.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGVenture

Why RAGVenture?

System Requirements

Quick Start

Features & Capabilities

Core Features

Command-Line Interface

Smart Model Management

Technical Features

Performance

Docker Support

Prerequisites

Quick Start with Docker

Run from GitHub Container Registry (GHCR)

Development Setup

Testing & Offline Policy

Data Requirements

Troubleshooting

Documentation

Contributing

License

Startup Names and Legal Considerations

Name Generation

Important Notes for Users

Name Verification Resources

Future Features

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

valginer0/rag_startups

Folders and files

Latest commit

History

Repository files navigation

RAGVenture

Why RAGVenture?

System Requirements

Quick Start

Features & Capabilities

Core Features

Command-Line Interface

Smart Model Management

Technical Features

Performance

Docker Support

Prerequisites

Quick Start with Docker

Run from GitHub Container Registry (GHCR)

Development Setup

Testing & Offline Policy

Data Requirements

Troubleshooting

Documentation

Contributing

License

Startup Names and Legal Considerations

Name Generation

Important Notes for Users

Name Verification Resources

Future Features

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages