A semantic search-powered dictionary that lets you find words by describing their meaning in natural language.
- Semantic Search: Type natural language descriptions and get matching words
- Examples:
- "sad and happy at the same time" → "bittersweet", "melancholy", "wistful"
- "fear of missing out" → "FOMO", "anxiety", "envy"
- "when you're nostalgic for something that hasn't happened yet" → "anemoia"
- Fully Offline: All data and models run locally
- Fast: Vector similarity search with FAISS (millisecond query times)
- Data Source: Wiktionary (via kaikki.org) - ~1.7M English word senses
- Embeddings: sentence-transformers (all-MiniLM-L6-v2) - 22MB model, 384-dim vectors
- Vector Index: FAISS - Fast approximate nearest neighbor search
- Storage: SQLite - Word definitions and metadata
This project uses uv for Python package management:
# If you don't have uv, install it first
# curl -LsSf https://astral.sh/uv/install.sh | sh
# Dependencies will be automatically managed by uv# Download Wiktionary data (~2.3GB compressed, ~20GB uncompressed)
# This will take 10-30 minutes depending on your connection
uv run python scripts/download_data.py
# Build the semantic search index
# This will take 15-45 minutes depending on your CPU
uv run python scripts/build_index.pyThe build process will:
- Create a SQLite database with ~500k+ English word definitions
- Generate embeddings for all definitions using a neural network
- Build a FAISS index for fast similarity search
- Save everything to the
data/directory
# Search for words by describing what you mean
uv run python search.py "sad and happy at the same time"
# More examples
uv run python search.py "fear of missing out"
uv run python search.py "feeling lonely in a crowd"
uv run python search.py "when you're nostalgic for something that hasn't happened yet"fuzzy_dictionary/
├── data/ # Generated data (not in git)
│ ├── wiktionary_en.jsonl # English word entries
│ ├── dictionary.db # SQLite database
│ ├── faiss_index.bin # FAISS vector index
│ └── word_mappings.pkl # Index-to-word mappings
├── scripts/
│ ├── download_data.py # Download Wiktionary data
│ └── build_index.py # Build search index
├── search.py # Main search interface
├── pyproject.toml # Project configuration
└── README.md # This file
-
Data Preparation:
- Downloads Wiktionary data from kaikki.org (pre-processed JSON)
- Filters for English entries with definitions
- Stores in SQLite for easy access
-
Embedding Generation:
- Uses sentence-transformers to convert each definition into a 384-dimensional vector
- These vectors capture semantic meaning, not just keywords
- Definitions with similar meanings have similar vectors
-
Semantic Search:
- Your query is converted to a vector
- FAISS finds the nearest vectors in the index (cosine similarity)
- Returns the corresponding words, ranked by similarity
- Index Size: ~500MB for embeddings + ~200MB for database
- Query Time: <100ms for top-10 results
- Build Time: 15-45 minutes (one-time)
- Memory Usage: ~2GB during search
# Find emotional states
uv run python search.py "feeling happy and sad at the same time"
# Find descriptive words
uv run python search.py "shiny and smooth like glass"
# Find actions
uv run python search.py "walking slowly without purpose"
# Find specific concepts
uv run python search.py "fear of long words"- Web interface for easier searching
- Support for multiple languages
- Fuzzy word matching for typos in word lookups
- Pronunciation and audio
- Etymology and word relationships
- Mobile app version
- Specifically designed for semantic similarity
- Pre-trained on 1B+ sentence pairs
- Compact model size (22MB)
- Fast inference (thousands of sentences/second)
- Runs on CPU
- Developed by Meta AI for billion-scale similarity search
- Extremely fast approximate nearest neighbor search
- Memory efficient
- Battle-tested in production systems
- Zero configuration
- Embedded database (no server needed)
- Perfect for local-first applications
- Cross-platform
- Built-in full-text search (for future features)
Data: Wiktionary data is licensed under Creative Commons
Code: MIT (to be added)