Fuzzy Dictionary

A semantic search-powered dictionary that lets you find words by describing their meaning in natural language.

Features

Semantic Search: Type natural language descriptions and get matching words
Examples:
- "sad and happy at the same time" → "bittersweet", "melancholy", "wistful"
- "fear of missing out" → "FOMO", "anxiety", "envy"
- "when you're nostalgic for something that hasn't happened yet" → "anemoia"
Fully Offline: All data and models run locally
Fast: Vector similarity search with FAISS (millisecond query times)

Architecture

Data Source: Wiktionary (via kaikki.org) - ~1.7M English word senses
Embeddings: sentence-transformers (all-MiniLM-L6-v2) - 22MB model, 384-dim vectors
Vector Index: FAISS - Fast approximate nearest neighbor search
Storage: SQLite - Word definitions and metadata

Setup

1. Install dependencies

This project uses uv for Python package management:

# If you don't have uv, install it first
# curl -LsSf https://astral.sh/uv/install.sh | sh

# Dependencies will be automatically managed by uv

2. Download and build dictionary

# Download Wiktionary data (~2.3GB compressed, ~20GB uncompressed)
# This will take 10-30 minutes depending on your connection
uv run python scripts/download_data.py

# Build the semantic search index
# This will take 15-45 minutes depending on your CPU
uv run python scripts/build_index.py

The build process will:

Create a SQLite database with ~500k+ English word definitions
Generate embeddings for all definitions using a neural network
Build a FAISS index for fast similarity search
Save everything to the data/ directory

3. Search!

# Search for words by describing what you mean
uv run python search.py "sad and happy at the same time"

# More examples
uv run python search.py "fear of missing out"
uv run python search.py "feeling lonely in a crowd"
uv run python search.py "when you're nostalgic for something that hasn't happened yet"

Project Structure

fuzzy_dictionary/
├── data/                      # Generated data (not in git)
│   ├── wiktionary_en.jsonl   # English word entries
│   ├── dictionary.db          # SQLite database
│   ├── faiss_index.bin        # FAISS vector index
│   └── word_mappings.pkl      # Index-to-word mappings
├── scripts/
│   ├── download_data.py       # Download Wiktionary data
│   └── build_index.py         # Build search index
├── search.py                  # Main search interface
├── pyproject.toml             # Project configuration
└── README.md                  # This file

How It Works

Data Preparation:
- Downloads Wiktionary data from kaikki.org (pre-processed JSON)
- Filters for English entries with definitions
- Stores in SQLite for easy access
Embedding Generation:
- Uses sentence-transformers to convert each definition into a 384-dimensional vector
- These vectors capture semantic meaning, not just keywords
- Definitions with similar meanings have similar vectors
Semantic Search:
- Your query is converted to a vector
- FAISS finds the nearest vectors in the index (cosine similarity)
- Returns the corresponding words, ranked by similarity

Performance

Index Size: ~500MB for embeddings + ~200MB for database
Query Time: <100ms for top-10 results
Build Time: 15-45 minutes (one-time)
Memory Usage: ~2GB during search

Examples

# Find emotional states
uv run python search.py "feeling happy and sad at the same time"

# Find descriptive words
uv run python search.py "shiny and smooth like glass"

# Find actions
uv run python search.py "walking slowly without purpose"

# Find specific concepts
uv run python search.py "fear of long words"

Future Enhancements

Web interface for easier searching
Support for multiple languages
Fuzzy word matching for typos in word lookups
Pronunciation and audio
Etymology and word relationships
Mobile app version

Technical Details

Why sentence-transformers?

Specifically designed for semantic similarity
Pre-trained on 1B+ sentence pairs
Compact model size (22MB)
Fast inference (thousands of sentences/second)
Runs on CPU

Why FAISS?

Developed by Meta AI for billion-scale similarity search
Extremely fast approximate nearest neighbor search
Memory efficient
Battle-tested in production systems

Why SQLite?

Zero configuration
Embedded database (no server needed)
Perfect for local-first applications
Cross-platform
Built-in full-text search (for future features)

License

Data: Wiktionary data is licensed under Creative Commons

Code: MIT (to be added)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
frontend		frontend
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fuzzy Dictionary

Features

Architecture

Setup

1. Install dependencies

2. Download and build dictionary

3. Search!

Project Structure

How It Works

Performance

Examples

Future Enhancements

Technical Details

Why sentence-transformers?

Why FAISS?

Why SQLite?

License

About

Uh oh!

Releases

Packages

Languages

aegatlin/fuzzy_dictionary

Folders and files

Latest commit

History

Repository files navigation

Fuzzy Dictionary

Features

Architecture

Setup

1. Install dependencies

2. Download and build dictionary

3. Search!

Project Structure

How It Works

Performance

Examples

Future Enhancements

Technical Details

Why sentence-transformers?

Why FAISS?

Why SQLite?

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages