A Rust-based tool that creates a comprehensive database of French first names and last names by processing death records from INSEE (French National Institute of Statistics and Economic Studies).
If you appreciate my work, please consider giving it a star! 🤩 or a
The tool was created to extract and normalize first names and last names from INSEE death records data. The datas was extracted for creating a dataset with realistic names found in France for machine deep learning.
A simple demonstration of a use case with the extracted data is available in the username_generator
directory. The tool generates random usernames using the extracted first names and last names. The names are ponderated by their occurrences in the database. The Vue3 application is deployed on github pages: https://sctg-development.github.io/french-names-extractor/
The tool ignores the following cases:
- Names with one character
- Names containing only the same character
- By default it only stores names with more than one occurrence
- Processes multiple CSV files from INSEE death records
- Extracts and normalizes first names and last names
- Records gender information for first names
- Counts occurrences of each name
- Generates structured JSON output files
- Handles special cases and data cleanup
- Rust 1.70 or higher
- Cargo package manager
# Clone the repository
git clone https://github.com/yourusername/french-names-extractor
cd french-names-extractor
# Build the project
cargo build --release
Command Line Options:
- -p, --path : Directory containing INSEE CSV files (required)
- -m, --multiple: true/false : store only occurrences > 1 (default: true)
- -c, --csv: true/false : also create csv files (default: false)
- -h, --help : Show help information
- -V, --version : Display version information
The tool generates two JSON files:
firstnames.json
{
"firstnames": [
{
"firstname": "jean",
"sexe": 1,
"occurrences": 1822998
}
]
}
lastnames.json
{
"lastnames": [
{
"lastname": "dupont",
"occurrences": 26339
}
]
}
firstnames.json
, lastnames.json
, firstnames.csv
and lastnames.csv
in the repository are generated with INSEE death records data from 1970 to september 2024 (inclusive) with the parameter -c true
.
The death records data is sourced from INSEE's public database: https://www.insee.fr/fr/information/4769950
The extracted data can be used to create a machine learning dataset for training models to generate realistic French names. Two datasets are provided in Hugging Face's datasets library:
from datasets import load_dataset
ds = load_dataset("eltorio/french_first_names_insee_2024")
from datasets import load_dataset
ds = load_dataset("eltorio/french_last_names_insee_2024")
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE.md file for details.
Copyright © 2024 Ronan LE MEILLAT