🎬📊 Embeddings Visualization API

A FastAPI application that generates beautiful t-SNE visualizations of text embeddings for articles and movies using OpenAI's embedding model.

🚀 Features

Text Embeddings: Generate high-quality embeddings using OpenAI's text-embedding-3-small model
t-SNE Visualization: Create interactive 2D visualizations of embedding clusters
Multiple Data Sources:
- Local dummy data for testing
- Hugging Face datasets integration
RESTful API: Clean, documented endpoints with FastAPI
Static File Serving: Automatically generated and served visualization images
Error Handling: Comprehensive error handling with informative responses

📁 Project Structure

embeddings/
├── main.py              # Main FastAPI application
├── data/                # JSON data files
│   ├── articles.json    # Sample articles for testing
│   └── movies.json      # Sample movies for testing
├── static/              # Generated visualization images
├── test_setup.py        # Complete setup validation
├── test_data_only.py    # Quick data loading test
├── api_test.py          # API endpoint testing
├── start.sh             # Quick start script
├── .env                 # Environment variables (create this)
├── .env.example         # Environment template
├── .gitignore           # Git ignore rules
├── requirements.txt     # Python dependencies
└── README.md           # This file

🛠️ Installation & Setup

Prerequisites

Python 3.8+
OpenAI API key
Git

1. Clone the Repository

git clone https://github.com/ahmadfreijeh/embeddings-visualisation-.git
cd embeddings-visualisation-

2. Create Virtual Environment

python -m venv venv

# On macOS/Linux:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Environment Setup

Create a .env file in the root directory:

OPEN_API_KEY=your_openai_api_key_here

⚠️ Important: Never commit your .env file to version control!

5. Create Requirements File

# Generate requirements.txt if it doesn't exist
pip freeze > requirements.txt

Expected dependencies:

fastapi>=0.100.0
uvicorn>=0.23.0
openai>=1.0.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
numpy>=1.24.0
pandas>=2.0.0
python-dotenv>=1.0.0
python-multipart>=0.0.6

🚀 Running the Application

Development Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Production Server

uvicorn main:app --host 0.0.0.0 --port 8000

The API will be available at:

Application: http://localhost:8000
Interactive Docs: http://localhost:8000/docs
Alternative Docs: http://localhost:8000/redoc

⚡ Quick Start Examples

1. Test the API Health

curl http://localhost:8000/

2. Process Sample Articles

curl "http://localhost:8000/process?type=articles"

Expected output:

✅ Generates embeddings for 8 sample articles
🎨 Creates tsne_articles_dummy.png visualization
📊 Shows clustering of ML, cooking, and science topics

3. Process Sample Movies

curl "http://localhost:8000/process?type=movies"

Expected output:

✅ Processes 6 classic movies (Pulp Fiction, The Matrix, etc.)
🎨 Creates tsne_movies_dummy.png visualization
📊 Shows genre-based clustering

4. Try Real Movie Dataset

curl "http://localhost:8000/process?type=movies&source=huggingface"

Expected output:

✅ Processes 100+ real movies from Hugging Face
🎨 Creates tsne_movies_huggingface.png visualization
📊 Shows professional movie data clustering

5. Explore Available Data

curl "http://localhost:8000/data-info"

What you'll get:

📋 Complete list of available fields for each dataset
🔍 Sample data structure for each source
💡 Field auto-detection priorities

📚 API Documentation

Base URL

http://localhost:8000

Endpoints

`GET /`

Description: API information and documentation

Response:

{
  "message": "Welcome to the Embeddings Visualization API!",
  "description": "Generate t-SNE visualizations of text embeddings for articles and movies",
  "endpoints": {
    "/process": "Generate embeddings and visualizations",
    "/process?type=articles": "Process articles (dummy data)",
    "/process?type=movies": "Process movies (dummy data)",
    "/process?type=movies&source=huggingface": "Process movies from Hugging Face dataset"
  },
  "parameters": {
    "type": "Data type: 'articles' or 'movies' (default: 'articles')",
    "source": "Data source: 'dummy' or 'huggingface' (default: 'dummy', only for movies)"
  }
}

`GET /process`

Description: Process data and generate t-SNE visualization

Query Parameters:

type (string, optional): Data type to process
- articles (default): Process article data
- movies: Process movie data
source (string, optional): Data source (only for movies)
- dummy (default): Use local JSON data
- huggingface: Use Hugging Face dataset

Examples:

# Process articles (default)
curl "http://localhost:8000/process"

# Process dummy movies
curl "http://localhost:8000/process?type=movies"

# Process Hugging Face movies
curl "http://localhost:8000/process?type=movies&source=huggingface"

Success Response:

{
  "success": true,
  "data": {
    "type": "articles",
    "source": "dummy",
    "count": 8,
    "texts": ["Machine learning is...", "..."],
    "chart_url": "http://localhost:8000/static/tsne_articles_dummy.png"
  }
}

Error Response:

{
  "success": false,
  "error": "Error message details",
  "message": "Failed to process data and generate visualization"
}

🎯 Visual Examples

Example t-SNE Visualization

Here's what the generated charts look like:

Example t-SNE visualization showing clustering of articles by topic: ML/AI (blue cluster), Cooking (green cluster), and Science/Astronomy (red cluster)

Articles Processing

Basic Articles Request

curl "http://localhost:8000/process?type=articles"

Response:

{
  "success": true,
  "data": {
    "type": "articles",
    "source": "dummy",
    "count": 8,
    "fields_used": {
      "text_field": "content",
      "title_field": "title"
    },
    "available_fields": ["id", "title", "content"],
    "texts": [
      "Machine learning is a subset of artificial intelligence...",
      "Italian cuisine is known for its regional diversity...",
      "The universe is estimated to be 13.8 billion years old..."
    ],
    "chart_url": "http://localhost:8000/static/tsne_articles_dummy.png"
  }
}

Custom Field Mapping for Articles

curl "http://localhost:8000/process?type=articles&text_field=content&title_field=title"

Movies Processing

Basic Movies Request (Dummy Data)

curl "http://localhost:8000/process?type=movies"

Response:

{
  "success": true,
  "data": {
    "type": "movies",
    "source": "dummy",
    "count": 6,
    "fields_used": {
      "text_field": "plot",
      "title_field": "title"
    },
    "available_fields": ["id", "title", "plot", "runtime", "genre", "released"],
    "texts": [
      "The lives of two mob hitmen, a boxer, a gangster and his wife...",
      "A computer programmer is rescued from the Matrix...",
      "A thief who enters people's dreams and steals their secrets..."
    ],
    "chart_url": "http://localhost:8000/static/tsne_movies_dummy.png"
  }
}

Hugging Face Movies Dataset

curl "http://localhost:8000/process?type=movies&source=huggingface"

Response:

{
  "success": true,
  "data": {
    "type": "movies",
    "source": "huggingface",
    "count": 100,
    "fields_used": {
      "text_field": "plot",
      "title_field": "title"
    },
    "available_fields": ["id", "title", "plot", "genre", "year", "director"],
    "texts": [
      "Real movie plot summaries from the Hugging Face dataset...",
      "Professional movie descriptions with rich metadata...",
      "Diverse collection spanning multiple decades and genres..."
    ],
    "chart_url": "http://localhost:8000/static/tsne_movies_huggingface.png"
  }
}

Custom Field Mapping for Movies

curl "http://localhost:8000/process?type=movies&text_field=plot&title_field=title&source=huggingface"

Error Handling Examples

Invalid Field Names

curl "http://localhost:8000/process?type=movies&text_field=invalid_field"

Error Response:

{
  "success": false,
  "error": "Text field 'invalid_field' not found in data and no suitable alternative detected. Available fields: ['id', 'title', 'plot', 'runtime', 'genre', 'released']. Please specify a valid text_field parameter from the available fields.",
  "error_type": "field_validation_error",
  "message": "Field validation failed. Please check available fields using /data-info endpoint.",
  "suggestion": "Use /data-info to see available fields, then specify text_field and/or title_field parameters."
}

Data Information Endpoint

curl "http://localhost:8000/data-info"

Response:

{
  "available_data_sources": {
    "articles": {
      "source": "static/dummy",
      "count": 8,
      "fields": ["id", "title", "content"],
      "sample": {
        "id": 1,
        "title": "Machine Learning Basics",
        "content": "Machine learning is a subset of artificial intelligence..."
      }
    },
    "movies_static": {
      "source": "static/dummy",
      "count": 6,
      "fields": ["id", "title", "plot", "runtime", "genre", "released"],
      "sample": {
        "id": 1,
        "title": "Pulp Fiction",
        "plot": "The lives of two mob hitmen, a boxer, a gangster and his wife...",
        "runtime": 154,
        "genre": "Crime",
        "released": 1994
      }
    },
    "movies_huggingface": {
      "source": "huggingface",
      "count": 100,
      "fields": ["id", "title", "plot", "genre", "year", "director"],
      "sample": {
        "id": 1,
        "title": "Sample Movie",
        "plot": "Professional movie description from dataset...",
        "genre": "Drama",
        "year": 2020,
        "director": "Sample Director"
      }
    }
  },
  "field_auto_detection": {
    "text_field_priority": ["content", "plot", "description", "text", "body"],
    "title_field_priority": ["title", "name", "heading", "subject"]
  },
  "usage_examples": {
    "default_articles": "/process?type=articles",
    "default_movies": "/process?type=movies",
    "custom_fields": "/process?type=movies&text_field=plot&title_field=title",
    "huggingface_movies": "/process?type=movies&source=huggingface"
  }
}

📊 Data Sources

1. Local JSON Files

Articles (`data/articles.json`)

Sample data covering various topics:

Machine Learning & AI
Cooking & Cuisine
Space & Astronomy

Schema:

{
  "id": "number",
  "title": "string",
  "content": "string"
}

Movies (`data/movies.json`)

Classic movie data including:

Plot summaries
Key scenes
Runtime and genre information

Schema:

{
  "id": "number",
  "title": "string",
  "plot": "string",
  "runtime": "number",
  "keyScene": "string",
  "genre": "string",
  "released": "number"
}

2. Hugging Face Integration

The API integrates with the Hugging Face leemthompo/small-movies dataset for additional movie data.

🔧 Configuration

Environment Variables

Variable	Description	Required	Default
`OPEN_API_KEY`	OpenAI API key	✅ Yes	None

Application Constants

Located in main.py:

EMBEDDING_MODEL = "text-embedding-3-small"  # OpenAI model
TSNE_COMPONENTS = 2                         # t-SNE dimensions
MIN_PERPLEXITY = 5                          # Minimum perplexity for t-SNE
FIGURE_SIZE = (10, 8)                       # Plot dimensions
FONT_SIZE = 8                               # Annotation font size
STATIC_DIR = "static"                       # Static files directory

🧪 Testing### Quick Data Test (No Dependencies)

Test just the data loading without installing packages:

python test_data_only.py

This validates:

JSON file syntax and structure
Data loading functionality
Required fields in data

Setup Validation

Test environment, data, and dependencies:

python test_setup.py

This checks:

Data files and structure
Environment configuration
Package availability
Code syntax

API Testing

Test all API endpoints (requires running server):

python api_test.py

This validates:

Server connectivity
All endpoint functionality
Response formats
Static file serving

Manual Testing

Health Check:

curl http://localhost:8000/

Process Articles:

curl "http://localhost:8000/process?type=articles"

Process Movies:

curl "http://localhost:8000/process?type=movies"

Custom Test Scripts

Create a test script (test_api.py):

import requests
import json

BASE_URL = "http://localhost:8000"

def test_health():
    response = requests.get(f"{BASE_URL}/")
    assert response.status_code == 200
    data = response.json()
    assert "message" in data

def test_process_articles():
    response = requests.get(f"{BASE_URL}/process?type=articles")
    assert response.status_code == 200
    data = response.json()
    assert data["success"] == True
    assert "chart_url" in data["data"]

if __name__ == "__main__":
    test_health()
    test_process_articles()
    print("All tests passed! ✅")

🐛 Troubleshooting

Common Issues

OpenAI API Key Error:
```
Error: Invalid API key
```
Solution: Check your .env file and ensure OPEN_API_KEY is correct.
Module Import Error:
```
ModuleNotFoundError: No module named 'fastapi'
```
Solution: Activate virtual environment and install dependencies.
Hugging Face Dataset Error:
```
Failed to load Hugging Face dataset
```
Solution: Check internet connection. API will fallback to dummy data.
Port Already in Use:
```
OSError: [Errno 48] Address already in use
```
Solution: Use a different port or kill the existing process.

Debug Mode

Run with debug logging:

uvicorn main:app --reload --log-level debug

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guidelines
Add type hints to all functions
Include docstrings for public functions
Add error handling for external API calls
Update tests when adding new features

📈 Performance Considerations

Embedding Generation: ~1-2 seconds per request (depends on text length)
t-SNE Computation: ~2-5 seconds (depends on data size)
Memory Usage: ~50-100MB (depends on dataset size)
Rate Limits: Subject to OpenAI API rate limits

🔮 Future Enhancements

Caching: Redis integration for embedding caching
Authentication: API key-based access control
Batch Processing: Process multiple datasets simultaneously
3D Visualizations: Optional 3D t-SNE plots
Custom Datasets: Upload and process custom JSON files
Export Options: PDF, SVG export formats
Interactive Plots: Plotly integration for interactive visualizations

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI for providing excellent embedding models
FastAPI for the robust web framework
Scikit-learn for t-SNE implementation
Hugging Face for dataset integration
Matplotlib for visualization capabilities

Happy coding! 🚀✨

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
static		static
.env.example		.env.example
.gitignore		.gitignore
API_ENHANCEMENTS.md		API_ENHANCEMENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
LICENSE		LICENSE
README.md		README.md
api_test.py		api_test.py
main.py		main.py
requirements.txt		requirements.txt
start.sh		start.sh
test_data_only.py		test_data_only.py
test_field_validation.py		test_field_validation.py
test_setup.py		test_setup.py

License

ahmadfreijeh/embeddings-visualisation-

Folders and files

Latest commit

History

Repository files navigation

🎬📊 Embeddings Visualization API

🚀 Features

📁 Project Structure

🛠️ Installation & Setup

Prerequisites

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Environment Setup

5. Create Requirements File

🚀 Running the Application

Development Server

Production Server

⚡ Quick Start Examples

1. Test the API Health

2. Process Sample Articles

3. Process Sample Movies

4. Try Real Movie Dataset

5. Explore Available Data

📚 API Documentation

Base URL

Endpoints

GET /

GET /process

🎯 Visual Examples

Example t-SNE Visualization

Articles Processing

Basic Articles Request

Custom Field Mapping for Articles

Movies Processing

Basic Movies Request (Dummy Data)

Hugging Face Movies Dataset

Custom Field Mapping for Movies

Error Handling Examples

Invalid Field Names

Data Information Endpoint

📊 Data Sources

1. Local JSON Files

Articles (data/articles.json)

Movies (data/movies.json)

2. Hugging Face Integration

🔧 Configuration

Environment Variables

Application Constants

🧪 Testing### Quick Data Test (No Dependencies)

Setup Validation

API Testing

Manual Testing

Custom Test Scripts

🐛 Troubleshooting

Common Issues

Debug Mode

🤝 Contributing

Development Guidelines

📈 Performance Considerations

🔮 Future Enhancements

📄 License

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GET /`

`GET /process`

Articles (`data/articles.json`)

Movies (`data/movies.json`)

Packages