Skip to content

aegatlin/fuzzy_dictionary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fuzzy Dictionary

A semantic search-powered dictionary that lets you find words by describing their meaning in natural language.

Features

  • Semantic Search: Type natural language descriptions and get matching words
  • Examples:
    • "sad and happy at the same time" → "bittersweet", "melancholy", "wistful"
    • "fear of missing out" → "FOMO", "anxiety", "envy"
    • "when you're nostalgic for something that hasn't happened yet" → "anemoia"
  • Fully Offline: All data and models run locally
  • Fast: Vector similarity search with FAISS (millisecond query times)

Architecture

  • Data Source: Wiktionary (via kaikki.org) - ~1.7M English word senses
  • Embeddings: sentence-transformers (all-MiniLM-L6-v2) - 22MB model, 384-dim vectors
  • Vector Index: FAISS - Fast approximate nearest neighbor search
  • Storage: SQLite - Word definitions and metadata

Setup

1. Install dependencies

This project uses uv for Python package management:

# If you don't have uv, install it first
# curl -LsSf https://astral.sh/uv/install.sh | sh

# Dependencies will be automatically managed by uv

2. Download and build dictionary

# Download Wiktionary data (~2.3GB compressed, ~20GB uncompressed)
# This will take 10-30 minutes depending on your connection
uv run python scripts/download_data.py

# Build the semantic search index
# This will take 15-45 minutes depending on your CPU
uv run python scripts/build_index.py

The build process will:

  1. Create a SQLite database with ~500k+ English word definitions
  2. Generate embeddings for all definitions using a neural network
  3. Build a FAISS index for fast similarity search
  4. Save everything to the data/ directory

3. Search!

# Search for words by describing what you mean
uv run python search.py "sad and happy at the same time"

# More examples
uv run python search.py "fear of missing out"
uv run python search.py "feeling lonely in a crowd"
uv run python search.py "when you're nostalgic for something that hasn't happened yet"

Project Structure

fuzzy_dictionary/
├── data/                      # Generated data (not in git)
│   ├── wiktionary_en.jsonl   # English word entries
│   ├── dictionary.db          # SQLite database
│   ├── faiss_index.bin        # FAISS vector index
│   └── word_mappings.pkl      # Index-to-word mappings
├── scripts/
│   ├── download_data.py       # Download Wiktionary data
│   └── build_index.py         # Build search index
├── search.py                  # Main search interface
├── pyproject.toml             # Project configuration
└── README.md                  # This file

How It Works

  1. Data Preparation:

    • Downloads Wiktionary data from kaikki.org (pre-processed JSON)
    • Filters for English entries with definitions
    • Stores in SQLite for easy access
  2. Embedding Generation:

    • Uses sentence-transformers to convert each definition into a 384-dimensional vector
    • These vectors capture semantic meaning, not just keywords
    • Definitions with similar meanings have similar vectors
  3. Semantic Search:

    • Your query is converted to a vector
    • FAISS finds the nearest vectors in the index (cosine similarity)
    • Returns the corresponding words, ranked by similarity

Performance

  • Index Size: ~500MB for embeddings + ~200MB for database
  • Query Time: <100ms for top-10 results
  • Build Time: 15-45 minutes (one-time)
  • Memory Usage: ~2GB during search

Examples

# Find emotional states
uv run python search.py "feeling happy and sad at the same time"

# Find descriptive words
uv run python search.py "shiny and smooth like glass"

# Find actions
uv run python search.py "walking slowly without purpose"

# Find specific concepts
uv run python search.py "fear of long words"

Future Enhancements

  • Web interface for easier searching
  • Support for multiple languages
  • Fuzzy word matching for typos in word lookups
  • Pronunciation and audio
  • Etymology and word relationships
  • Mobile app version

Technical Details

Why sentence-transformers?

  • Specifically designed for semantic similarity
  • Pre-trained on 1B+ sentence pairs
  • Compact model size (22MB)
  • Fast inference (thousands of sentences/second)
  • Runs on CPU

Why FAISS?

  • Developed by Meta AI for billion-scale similarity search
  • Extremely fast approximate nearest neighbor search
  • Memory efficient
  • Battle-tested in production systems

Why SQLite?

  • Zero configuration
  • Embedded database (no server needed)
  • Perfect for local-first applications
  • Cross-platform
  • Built-in full-text search (for future features)

License

Data: Wiktionary data is licensed under Creative Commons

Code: MIT (to be added)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published