Project Status

Surf more than 21 million name ideas across more than 400,000 name collections,
or generate infinite related name suggestions.

Project Status

NameGraph is currently in beta. We are excited to share our work with you and continue to build the greatest web of names in history!

Overview

NameGraph is a web service that generates name suggestions for a given input label. It is implemented using FastAPI and provides a variety of endpoints to generate suggestions in different modes and with different parameters.

Label Analysis

The input label is analyzed to determine the most relevant name suggestions. The analysis includes:

Defining all possible interpretations of the input label along with their probabilities (whether it is a sequence of common words, a person name, what is the language, etc.)
For each interpretation, determining most probable tokenizations (e.g. armstrong -> ["armstrong"], armstrong -> ["arm", "strong"])

The suggestions are later generated based on these interpretations, tokenizations being especially important, since many generators greatly rely on them. This is why the endpoints can handle pretokenized input.

Collections

Collections are curated sets of names that serve as a core component of NameGraph's name suggestion system. The system maintains a vast database of over 400,000 name collections containing more than 21 million unique names. Each collection is stored in Elasticsearch and contains:

A unique collection ID
Collection title and description
Collection rank and metadata
Member names with their normalized and tokenized forms
Collection types and categories
Related collections

Collections are used in several key ways:

Direct Name Generation:
- Searches collections based on input tokens
- Uses learning-to-rank models to find relevant collections
Related Collections:
- Finds collections with similar themes and content
- Ensures diverse suggestions across different categories
Membership Lookup:
- Discovers collections containing specific names
- Enables finding thematically related names

The collections are maintained and updated through our NameGraph Collections project, ensuring the suggestion database stays current and comprehensive.

Generators

Generators are core components that create name suggestions through different methods. Each generator inherits from the base NameGenerator class and implements specific name generation strategies. They can be grouped into the categories as shown in the diagram below:

NameKit

Modes

NameGraph supports three modes for processing requests:

Instant Mode (instant):
- Fastest response time
- More basic name generations
- Some advanced generators like W2VGenerator are disabled (weight multiplier = 0)
- Often used for real-time suggestions
Domain Detail Mode (domain_detail):
- Intermediate between instant and full
- More comprehensive than instant, but still optimized for performance
- Some generators have reduced weights compared to full mode
- Expanded search window for collection ranking and sampling
Full Mode (full):
- Most comprehensive name generation
- Includes all enabled generators
- Uses full weights for most generators
- Accesses advanced generators like Wikipedia2VGenerator and W2VGenerator
- Takes longer to process, but provides the most diverse results

Different generators are enabled/disabled for each mode. Take a look at the generators diagram to see which generators are available in each mode.

Icon	Mode	Description
	Instant	Fastest response, basic generators only
	Domain Detail	Balanced speed/quality, expanded search
	Full	Comprehensive generation with all generators

Sampler

The sampler is a sophisticated component that manages the selection and generation of name suggestions. It implements a probabilistic sampling algorithm that balances diversity, relevance, and efficiency while respecting various constraints.

Key Components

Request Parameters:
- mode: Determines which generators are active (instant/domain_detail/full)
- min_suggestions: Minimum number of suggestions to return
- max_suggestions: Maximum number of suggestions to return
- min_available_fraction: Minimum fraction of suggestions that must be available
Interpretations: Each input name can have multiple interpretations, characterized by:
- Type (ngram, person, other)
- Language
- Probability score
- Possible tokenizations

Sampling Algorithm

The sampler uses a probabilistic approach to generate diverse and relevant name suggestions:

flowchart TD
    A[Start] --> B{Enough suggestions?}
    B -->|Yes| Z[End]
    B -->|No| C{All probabilities = 0?}
    C -->|Yes| Z
    C -->|No| D[Sample type & language]
    D --> E["Sample tokenization"]
    E --> F[Sample pipeline]
    F --> G{Pipeline exceeds limit?}
    G -->|Yes| F
    G -->|No| H[Get suggestion from pipeline]
    H --> I{Any suggestions left?}
    I -->|Yes| J{Already sampled?}
    I -->|No| F
    J -->|Yes| H
    J -->|No| K{Available if required?}
    K -->|No| H
    K -->|Yes| L{Normalized?}
    L -->|No| H
    L -->|Yes| B

The algorithm works as follows:

Initialization: For each type-language pair, pipeline probabilities are computed.
Main Loop: The sampler iterates until either:
- Enough suggestions are generated (max_suggestions met)
- All pipeline probabilities become zero
Sampling Process:
- First samples a type and language pair
- Then samples a specific tokenization within that pair
- Selects a pipeline using probability-based sampling
- First pass uses sampling without replacement for diversity
Validation Checks:
- Verifies pipeline hasn't exceeded its global limit
- Ensures suggestions aren't duplicates
- Checks availability status if required
- Confirms normalization status
Pipeline Management:
- Exhausted pipelines are removed from the sampling pool
- When a pipeline can't generate more suggestions, falls back to other pipelines

This approach ensures a balanced mix of suggestions while maintaining efficiency and respecting all configured constraints.

Usage

NameGraph uses Poetry for dependency management and packaging. Before getting started, make sure you have Poetry installed on your system.

Prerequisites

Install Poetry if you haven't already:

curl -sSL https://install.python-poetry.org | python3 -

Visit Poetry installation guide for more details.

Install

Clone the repository and install dependencies:

git clone https://github.com/namehash/namegraph.git
cd namegraph
poetry install

Download resources

Additional resources need to be downloaded. Run these commands within the Poetry environment:

poetry run python download.py  # dictionaries, embeddings
poetry run python download_names.py

Configuration

NameGraph uses Hydra - a framework for elegantly configuring complex applications. The configuration is stored in the conf/ directory and includes:

Main configuration files (prod_config_new.yaml, test_config_new.yaml) with core settings like connections, filters, limits, and paths
Pipeline configurations in conf/pipelines/ defining generators, modes, categories, and language settings

The configuration is highly modular and can be easily modified to adjust the behavior of name generation, filtering, and ranking systems.

REST API

Start server using Poetry:

poetry run uvicorn web_api:app --reload

Query with POST:

curl -d '{"label":"armstrong"}' -H "Content-Type: application/json" -X POST http://localhost:8000

Query with POST (pretokenized input):

curl -d '{"label":"\"arm strong\""}' -H "Content-Type: application/json" -X POST http://localhost:8000

Note: pretokenized input should be wrapped in double quotes.

Documentation

The API documentation is available at /docs or /redoc when the server is running. These are auto-generated Swagger/OpenAPI docs provided by FastAPI that allow you to:

View all available endpoints
See request/response schemas
See descriptions of each parameter and response field
Test API calls directly from the browser

Public API documentation is available at api.namegraph.dev/docs.

Tests

Run tests using Poetry:

poetry run pytest

Tests that interact with external services (Elasticsearch) are marked with integration_test marker and are disabled by default. Define environment variables needed to access Elasticsearch and run them using:

poetry run pytest -m "integration_test"

Learning-To-Rank

To access the LTR features, you need to configure it in the Elasticsearch instance (see here for more details).

Name		Name	Last commit message	Last commit date
Latest commit History 1,278 Commits
.github/workflows		.github/workflows
conf		conf
data		data
docs		docs
download		download
namegraph		namegraph
research		research
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.tool-versions		.tool-versions
Dockerfile		Dockerfile
LICENSE		LICENSE
authorize-ecr.sh		authorize-ecr.sh
collection_models.py		collection_models.py
docker-compose-elasticsearch.yml		docker-compose-elasticsearch.yml
docker-compose.build.yml		docker-compose.build.yml
docker-compose.yml		docker-compose.yml
healthcheck.py		healthcheck.py
models.py		models.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
readme.md		readme.md
web_api.py		web_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Status

Overview

Label Analysis

Collections

Generators

Modes

Sampler

Key Components

Sampling Algorithm

Usage

Prerequisites

Install

Download resources

Configuration

REST API

Documentation

Tests

Learning-To-Rank

About

Contributors 8

Languages

License

namehash/namegraph

Folders and files

Latest commit

History

Repository files navigation

Project Status

Overview

Label Analysis

Collections

Generators

Modes

Sampler

Key Components

Sampling Algorithm

Usage

Prerequisites

Install

Download resources

Configuration

REST API

Documentation

Tests

Learning-To-Rank

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 8

Languages