A FastAPI application that generates beautiful t-SNE visualizations of text embeddings for articles and movies using OpenAI's embedding model.
- Text Embeddings: Generate high-quality embeddings using OpenAI's
text-embedding-3-smallmodel - t-SNE Visualization: Create interactive 2D visualizations of embedding clusters
- Multiple Data Sources:
- Local dummy data for testing
- Hugging Face datasets integration
- RESTful API: Clean, documented endpoints with FastAPI
- Static File Serving: Automatically generated and served visualization images
- Error Handling: Comprehensive error handling with informative responses
embeddings/
โโโ main.py # Main FastAPI application
โโโ data/ # JSON data files
โ โโโ articles.json # Sample articles for testing
โ โโโ movies.json # Sample movies for testing
โโโ static/ # Generated visualization images
โโโ test_setup.py # Complete setup validation
โโโ test_data_only.py # Quick data loading test
โโโ api_test.py # API endpoint testing
โโโ start.sh # Quick start script
โโโ .env # Environment variables (create this)
โโโ .env.example # Environment template
โโโ .gitignore # Git ignore rules
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
- Python 3.8+
- OpenAI API key
- Git
git clone https://github.com/ahmadfreijeh/embeddings-visualisation-.git
cd embeddings-visualisation-python -m venv venv
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the root directory:
OPEN_API_KEY=your_openai_api_key_here.env file to version control!
# Generate requirements.txt if it doesn't exist
pip freeze > requirements.txtExpected dependencies:
fastapi>=0.100.0
uvicorn>=0.23.0
openai>=1.0.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
numpy>=1.24.0
pandas>=2.0.0
python-dotenv>=1.0.0
python-multipart>=0.0.6
uvicorn main:app --reload --host 0.0.0.0 --port 8000uvicorn main:app --host 0.0.0.0 --port 8000The API will be available at:
- Application: http://localhost:8000
- Interactive Docs: http://localhost:8000/docs
- Alternative Docs: http://localhost:8000/redoc
curl http://localhost:8000/curl "http://localhost:8000/process?type=articles"Expected output:
- โ Generates embeddings for 8 sample articles
- ๐จ Creates
tsne_articles_dummy.pngvisualization - ๐ Shows clustering of ML, cooking, and science topics
curl "http://localhost:8000/process?type=movies"Expected output:
- โ Processes 6 classic movies (Pulp Fiction, The Matrix, etc.)
- ๐จ Creates
tsne_movies_dummy.pngvisualization - ๐ Shows genre-based clustering
curl "http://localhost:8000/process?type=movies&source=huggingface"Expected output:
- โ Processes 100+ real movies from Hugging Face
- ๐จ Creates
tsne_movies_huggingface.pngvisualization - ๐ Shows professional movie data clustering
curl "http://localhost:8000/data-info"What you'll get:
- ๐ Complete list of available fields for each dataset
- ๐ Sample data structure for each source
- ๐ก Field auto-detection priorities
http://localhost:8000
Description: API information and documentation
Response:
{
"message": "Welcome to the Embeddings Visualization API!",
"description": "Generate t-SNE visualizations of text embeddings for articles and movies",
"endpoints": {
"/process": "Generate embeddings and visualizations",
"/process?type=articles": "Process articles (dummy data)",
"/process?type=movies": "Process movies (dummy data)",
"/process?type=movies&source=huggingface": "Process movies from Hugging Face dataset"
},
"parameters": {
"type": "Data type: 'articles' or 'movies' (default: 'articles')",
"source": "Data source: 'dummy' or 'huggingface' (default: 'dummy', only for movies)"
}
}Description: Process data and generate t-SNE visualization
Query Parameters:
type(string, optional): Data type to processarticles(default): Process article datamovies: Process movie data
source(string, optional): Data source (only for movies)dummy(default): Use local JSON datahuggingface: Use Hugging Face dataset
Examples:
# Process articles (default)
curl "http://localhost:8000/process"
# Process dummy movies
curl "http://localhost:8000/process?type=movies"
# Process Hugging Face movies
curl "http://localhost:8000/process?type=movies&source=huggingface"Success Response:
{
"success": true,
"data": {
"type": "articles",
"source": "dummy",
"count": 8,
"texts": ["Machine learning is...", "..."],
"chart_url": "http://localhost:8000/static/tsne_articles_dummy.png"
}
}Error Response:
{
"success": false,
"error": "Error message details",
"message": "Failed to process data and generate visualization"
}Here's what the generated charts look like:
Example t-SNE visualization showing clustering of articles by topic: ML/AI (blue cluster), Cooking (green cluster), and Science/Astronomy (red cluster)
curl "http://localhost:8000/process?type=articles"Response:
{
"success": true,
"data": {
"type": "articles",
"source": "dummy",
"count": 8,
"fields_used": {
"text_field": "content",
"title_field": "title"
},
"available_fields": ["id", "title", "content"],
"texts": [
"Machine learning is a subset of artificial intelligence...",
"Italian cuisine is known for its regional diversity...",
"The universe is estimated to be 13.8 billion years old..."
],
"chart_url": "http://localhost:8000/static/tsne_articles_dummy.png"
}
}curl "http://localhost:8000/process?type=articles&text_field=content&title_field=title"curl "http://localhost:8000/process?type=movies"Response:
{
"success": true,
"data": {
"type": "movies",
"source": "dummy",
"count": 6,
"fields_used": {
"text_field": "plot",
"title_field": "title"
},
"available_fields": ["id", "title", "plot", "runtime", "genre", "released"],
"texts": [
"The lives of two mob hitmen, a boxer, a gangster and his wife...",
"A computer programmer is rescued from the Matrix...",
"A thief who enters people's dreams and steals their secrets..."
],
"chart_url": "http://localhost:8000/static/tsne_movies_dummy.png"
}
}curl "http://localhost:8000/process?type=movies&source=huggingface"Response:
{
"success": true,
"data": {
"type": "movies",
"source": "huggingface",
"count": 100,
"fields_used": {
"text_field": "plot",
"title_field": "title"
},
"available_fields": ["id", "title", "plot", "genre", "year", "director"],
"texts": [
"Real movie plot summaries from the Hugging Face dataset...",
"Professional movie descriptions with rich metadata...",
"Diverse collection spanning multiple decades and genres..."
],
"chart_url": "http://localhost:8000/static/tsne_movies_huggingface.png"
}
}curl "http://localhost:8000/process?type=movies&text_field=plot&title_field=title&source=huggingface"curl "http://localhost:8000/process?type=movies&text_field=invalid_field"Error Response:
{
"success": false,
"error": "Text field 'invalid_field' not found in data and no suitable alternative detected. Available fields: ['id', 'title', 'plot', 'runtime', 'genre', 'released']. Please specify a valid text_field parameter from the available fields.",
"error_type": "field_validation_error",
"message": "Field validation failed. Please check available fields using /data-info endpoint.",
"suggestion": "Use /data-info to see available fields, then specify text_field and/or title_field parameters."
}curl "http://localhost:8000/data-info"Response:
{
"available_data_sources": {
"articles": {
"source": "static/dummy",
"count": 8,
"fields": ["id", "title", "content"],
"sample": {
"id": 1,
"title": "Machine Learning Basics",
"content": "Machine learning is a subset of artificial intelligence..."
}
},
"movies_static": {
"source": "static/dummy",
"count": 6,
"fields": ["id", "title", "plot", "runtime", "genre", "released"],
"sample": {
"id": 1,
"title": "Pulp Fiction",
"plot": "The lives of two mob hitmen, a boxer, a gangster and his wife...",
"runtime": 154,
"genre": "Crime",
"released": 1994
}
},
"movies_huggingface": {
"source": "huggingface",
"count": 100,
"fields": ["id", "title", "plot", "genre", "year", "director"],
"sample": {
"id": 1,
"title": "Sample Movie",
"plot": "Professional movie description from dataset...",
"genre": "Drama",
"year": 2020,
"director": "Sample Director"
}
}
},
"field_auto_detection": {
"text_field_priority": ["content", "plot", "description", "text", "body"],
"title_field_priority": ["title", "name", "heading", "subject"]
},
"usage_examples": {
"default_articles": "/process?type=articles",
"default_movies": "/process?type=movies",
"custom_fields": "/process?type=movies&text_field=plot&title_field=title",
"huggingface_movies": "/process?type=movies&source=huggingface"
}
}Sample data covering various topics:
- Machine Learning & AI
- Cooking & Cuisine
- Space & Astronomy
Schema:
{
"id": "number",
"title": "string",
"content": "string"
}Classic movie data including:
- Plot summaries
- Key scenes
- Runtime and genre information
Schema:
{
"id": "number",
"title": "string",
"plot": "string",
"runtime": "number",
"keyScene": "string",
"genre": "string",
"released": "number"
}The API integrates with the Hugging Face leemthompo/small-movies dataset for additional movie data.
| Variable | Description | Required | Default |
|---|---|---|---|
OPEN_API_KEY |
OpenAI API key | โ Yes | None |
Located in main.py:
EMBEDDING_MODEL = "text-embedding-3-small" # OpenAI model
TSNE_COMPONENTS = 2 # t-SNE dimensions
MIN_PERPLEXITY = 5 # Minimum perplexity for t-SNE
FIGURE_SIZE = (10, 8) # Plot dimensions
FONT_SIZE = 8 # Annotation font size
STATIC_DIR = "static" # Static files directoryTest just the data loading without installing packages:
python test_data_only.pyThis validates:
- JSON file syntax and structure
- Data loading functionality
- Required fields in data
Test environment, data, and dependencies:
python test_setup.pyThis checks:
- Data files and structure
- Environment configuration
- Package availability
- Code syntax
Test all API endpoints (requires running server):
python api_test.pyThis validates:
- Server connectivity
- All endpoint functionality
- Response formats
- Static file serving
- Health Check:
curl http://localhost:8000/- Process Articles:
curl "http://localhost:8000/process?type=articles"- Process Movies:
curl "http://localhost:8000/process?type=movies"Create a test script (test_api.py):
import requests
import json
BASE_URL = "http://localhost:8000"
def test_health():
response = requests.get(f"{BASE_URL}/")
assert response.status_code == 200
data = response.json()
assert "message" in data
def test_process_articles():
response = requests.get(f"{BASE_URL}/process?type=articles")
assert response.status_code == 200
data = response.json()
assert data["success"] == True
assert "chart_url" in data["data"]
if __name__ == "__main__":
test_health()
test_process_articles()
print("All tests passed! โ
")-
OpenAI API Key Error:
Error: Invalid API keySolution: Check your
.envfile and ensureOPEN_API_KEYis correct. -
Module Import Error:
ModuleNotFoundError: No module named 'fastapi'Solution: Activate virtual environment and install dependencies.
-
Hugging Face Dataset Error:
Failed to load Hugging Face datasetSolution: Check internet connection. API will fallback to dummy data.
-
Port Already in Use:
OSError: [Errno 48] Address already in useSolution: Use a different port or kill the existing process.
Run with debug logging:
uvicorn main:app --reload --log-level debug- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Include docstrings for public functions
- Add error handling for external API calls
- Update tests when adding new features
- Embedding Generation: ~1-2 seconds per request (depends on text length)
- t-SNE Computation: ~2-5 seconds (depends on data size)
- Memory Usage: ~50-100MB (depends on dataset size)
- Rate Limits: Subject to OpenAI API rate limits
- Caching: Redis integration for embedding caching
- Authentication: API key-based access control
- Batch Processing: Process multiple datasets simultaneously
- 3D Visualizations: Optional 3D t-SNE plots
- Custom Datasets: Upload and process custom JSON files
- Export Options: PDF, SVG export formats
- Interactive Plots: Plotly integration for interactive visualizations
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing excellent embedding models
- FastAPI for the robust web framework
- Scikit-learn for t-SNE implementation
- Hugging Face for dataset integration
- Matplotlib for visualization capabilities
Happy coding! ๐โจ
