Farcaster "First Said"

A system that identifies which Farcaster user (FID) was the first to use specific words on the network, inspired by the NYT "First Said" feature.

Overview

This project processes a 157M-row dataset of Farcaster casts to:

Extract and tokenize words from each cast
Identify which FID said each word first
Provide a simple query interface to look up the history of first usages

The system filters out stop words and super common words to focus on meaningful "first said" instances.

Requirements

Python 3.8+
32GB RAM recommended
~200GB disk space for processing
Dependencies: duckdb, pandas, pyarrow, nltk, tqdm

Installation

# Clone the repository
git clone <repository-url>
cd farcaster-first-said

# Install dependencies
pip install -r requirements.txt

# Download NLTK resources
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"

Usage

The system consists of several scripts that handle different phases of the process:

1. Preprocessing and Tokenization

python process_casts.py /path/to/farcaster_casts.parquet

This script:

Loads the parquet file of Farcaster casts
Processes the casts in chunks to conserve memory
Tokenizes each cast, applying lemmatization
Filters out stop words and common words
Stores the processed tokens in a DuckDB database

Options:

--chunk-size: Number of rows to process at once (default: 100000)
--db-path: Path to the DuckDB database (default: farcaster.db)
--reset: Ignore checkpoint and start from beginning

2. First-Usage Identification

python identify_first_said.py

This script:

Identifies the first usage of each word
Handles tie-breaking (lowest FID wins if same timestamp)
Creates optimized tables for both lookup directions
Exports the data to parquet files for querying

Options:

--db-path: Path to the DuckDB database (default: farcaster.db)
--output-dir: Directory to save output files (default: ./output)

3. Querying the Data

python first_said_finder.py [command] [arguments]

Commands:

word [word]: Find which FID first said a given word
fid [fid]: Find all words first said by a given FID
search [pattern]: Search for words matching a pattern
stats: Show statistics about the dataset

Examples:

# Find who first said "ethereum"
python first_said_finder.py word ethereum

# Find all words first said by FID 1
python first_said_finder.py fid 1

# Search for words containing "meta"
python first_said_finder.py search meta

# Show dataset statistics
python first_said_finder.py stats

Processing Details

Text Processing Pipeline

Text Normalization:
- Convert to lowercase
- Remove URLs
- Strip # from hashtags and @ from mentions
- Remove standard punctuation
Tokenization & Filtering:
- Split text into words
- Apply lemmatization using NLTK's WordNetLemmatizer
- Filter out stop words and super typical words

Identification Logic

For each word, find the earliest timestamp it was used
If multiple FIDs used the word at the exact same timestamp, the lowest FID number is considered the "first"
Create bidirectional mappings:
- word → FID (who said it first)
- FID → list of words (what words they said first)

Data Files

After processing, two main parquet files are created:

first_said.parquet: Maps each word to the FID that said it first
- Columns: word, fid, timestamp
fid_to_words.parquet: Maps each FID to the list of words they said first
- Columns: fid, words (array), timestamps (array)

Performance Considerations

Processing 157M casts requires significant RAM and CPU resources
The chunking strategy allows processing on machines with limited memory
Checkpointing allows resuming processing if interrupted
The query interface uses lazy loading to minimize memory usage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Acknowledgments

NYT First Said project for the inspiration
Farcaster team for building an amazing protocol

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
ATTACK_PLAN.md		ATTACK_PLAN.md
OVERVIEW.md		OVERVIEW.md
README.md		README.md
checkpoint.txt		checkpoint.txt
first_said.log		first_said.log
first_said_finder.py		first_said_finder.py
identify_first_said.py		identify_first_said.py
preprocess.py		preprocess.py
preprocessing.log		preprocessing.log
process_casts.py		process_casts.py
processing.log		processing.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Farcaster "First Said"

Overview

Requirements

Installation

Usage

1. Preprocessing and Tokenization

2. First-Usage Identification

3. Querying the Data

Processing Details

Text Processing Pipeline

Identification Logic

Data Files

Performance Considerations

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

jc4p/farcaster-first-said

Folders and files

Latest commit

History

Repository files navigation

Farcaster "First Said"

Overview

Requirements

Installation

Usage

1. Preprocessing and Tokenization

2. First-Usage Identification

3. Querying the Data

Processing Details

Text Processing Pipeline

Identification Logic

Data Files

Performance Considerations

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages