Skip to content

jc4p/farcaster-first-said

Repository files navigation

Farcaster "First Said"

A system that identifies which Farcaster user (FID) was the first to use specific words on the network, inspired by the NYT "First Said" feature.

Overview

This project processes a 157M-row dataset of Farcaster casts to:

  1. Extract and tokenize words from each cast
  2. Identify which FID said each word first
  3. Provide a simple query interface to look up the history of first usages

The system filters out stop words and super common words to focus on meaningful "first said" instances.

Requirements

  • Python 3.8+
  • 32GB RAM recommended
  • ~200GB disk space for processing
  • Dependencies: duckdb, pandas, pyarrow, nltk, tqdm

Installation

# Clone the repository
git clone <repository-url>
cd farcaster-first-said

# Install dependencies
pip install -r requirements.txt

# Download NLTK resources
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('omw-1.4')"

Usage

The system consists of several scripts that handle different phases of the process:

1. Preprocessing and Tokenization

python process_casts.py /path/to/farcaster_casts.parquet

This script:

  • Loads the parquet file of Farcaster casts
  • Processes the casts in chunks to conserve memory
  • Tokenizes each cast, applying lemmatization
  • Filters out stop words and common words
  • Stores the processed tokens in a DuckDB database

Options:

  • --chunk-size: Number of rows to process at once (default: 100000)
  • --db-path: Path to the DuckDB database (default: farcaster.db)
  • --reset: Ignore checkpoint and start from beginning

2. First-Usage Identification

python identify_first_said.py

This script:

  • Identifies the first usage of each word
  • Handles tie-breaking (lowest FID wins if same timestamp)
  • Creates optimized tables for both lookup directions
  • Exports the data to parquet files for querying

Options:

  • --db-path: Path to the DuckDB database (default: farcaster.db)
  • --output-dir: Directory to save output files (default: ./output)

3. Querying the Data

python first_said_finder.py [command] [arguments]

Commands:

  • word [word]: Find which FID first said a given word
  • fid [fid]: Find all words first said by a given FID
  • search [pattern]: Search for words matching a pattern
  • stats: Show statistics about the dataset

Examples:

# Find who first said "ethereum"
python first_said_finder.py word ethereum

# Find all words first said by FID 1
python first_said_finder.py fid 1

# Search for words containing "meta"
python first_said_finder.py search meta

# Show dataset statistics
python first_said_finder.py stats

Processing Details

Text Processing Pipeline

  1. Text Normalization:

    • Convert to lowercase
    • Remove URLs
    • Strip # from hashtags and @ from mentions
    • Remove standard punctuation
  2. Tokenization & Filtering:

    • Split text into words
    • Apply lemmatization using NLTK's WordNetLemmatizer
    • Filter out stop words and super typical words

Identification Logic

  • For each word, find the earliest timestamp it was used
  • If multiple FIDs used the word at the exact same timestamp, the lowest FID number is considered the "first"
  • Create bidirectional mappings:
    • word → FID (who said it first)
    • FID → list of words (what words they said first)

Data Files

After processing, two main parquet files are created:

  1. first_said.parquet: Maps each word to the FID that said it first

    • Columns: word, fid, timestamp
  2. fid_to_words.parquet: Maps each FID to the list of words they said first

    • Columns: fid, words (array), timestamps (array)

Performance Considerations

  • Processing 157M casts requires significant RAM and CPU resources
  • The chunking strategy allows processing on machines with limited memory
  • Checkpointing allows resuming processing if interrupted
  • The query interface uses lazy loading to minimize memory usage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Acknowledgments

  • NYT First Said project for the inspiration
  • Farcaster team for building an amazing protocol

About

like nyt first said but for farcaster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages