A system that identifies which Farcaster user (FID) was the first to use specific words on the network, inspired by the NYT "First Said" feature.
This project processes a 157M-row dataset of Farcaster casts to:
- Extract and tokenize words from each cast
- Identify which FID said each word first
- Provide a simple query interface to look up the history of first usages
The system filters out stop words and super common words to focus on meaningful "first said" instances.
- Python 3.8+
- 32GB RAM recommended
- ~200GB disk space for processing
- Dependencies: duckdb, pandas, pyarrow, nltk, tqdm
# Clone the repository
git clone <repository-url>
cd farcaster-first-said
# Install dependencies
pip install -r requirements.txt
# Download NLTK resources
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"
The system consists of several scripts that handle different phases of the process:
python process_casts.py /path/to/farcaster_casts.parquet
This script:
- Loads the parquet file of Farcaster casts
- Processes the casts in chunks to conserve memory
- Tokenizes each cast, applying lemmatization
- Filters out stop words and common words
- Stores the processed tokens in a DuckDB database
Options:
--chunk-size
: Number of rows to process at once (default: 100000)--db-path
: Path to the DuckDB database (default: farcaster.db)--reset
: Ignore checkpoint and start from beginning
python identify_first_said.py
This script:
- Identifies the first usage of each word
- Handles tie-breaking (lowest FID wins if same timestamp)
- Creates optimized tables for both lookup directions
- Exports the data to parquet files for querying
Options:
--db-path
: Path to the DuckDB database (default: farcaster.db)--output-dir
: Directory to save output files (default: ./output)
python first_said_finder.py [command] [arguments]
Commands:
word [word]
: Find which FID first said a given wordfid [fid]
: Find all words first said by a given FIDsearch [pattern]
: Search for words matching a patternstats
: Show statistics about the dataset
Examples:
# Find who first said "ethereum"
python first_said_finder.py word ethereum
# Find all words first said by FID 1
python first_said_finder.py fid 1
# Search for words containing "meta"
python first_said_finder.py search meta
# Show dataset statistics
python first_said_finder.py stats
-
Text Normalization:
- Convert to lowercase
- Remove URLs
- Strip # from hashtags and @ from mentions
- Remove standard punctuation
-
Tokenization & Filtering:
- Split text into words
- Apply lemmatization using NLTK's WordNetLemmatizer
- Filter out stop words and super typical words
- For each word, find the earliest timestamp it was used
- If multiple FIDs used the word at the exact same timestamp, the lowest FID number is considered the "first"
- Create bidirectional mappings:
- word → FID (who said it first)
- FID → list of words (what words they said first)
After processing, two main parquet files are created:
-
first_said.parquet
: Maps each word to the FID that said it first- Columns: word, fid, timestamp
-
fid_to_words.parquet
: Maps each FID to the list of words they said first- Columns: fid, words (array), timestamps (array)
- Processing 157M casts requires significant RAM and CPU resources
- The chunking strategy allows processing on machines with limited memory
- Checkpointing allows resuming processing if interrupted
- The query interface uses lazy loading to minimize memory usage
Contributions are welcome! Please feel free to submit a Pull Request.
- NYT First Said project for the inspiration
- Farcaster team for building an amazing protocol