Token Telemetry System

A comprehensive token analysis system for CelestiaOrg repositories, designed to help AI agents understand codebase size, complexity, and structure for better context management and development assistance.

🎯 Purpose

This system automatically analyzes CelestiaOrg repositories and provides detailed token counts using the GPT-2 tokenizer. The results are published as a JSON API that AI agents can consume to make informed decisions about context window management, repository prioritization, and code analysis strategies.

🚀 Live API

Endpoint: https://celestiaorg.github.io/tokenmetry/index.json

The JSON includes comprehensive metadata, usage instructions, and detailed token analysis for all configured repositories.

📊 What's Analyzed

The system analyzes repositories configured in repos.txt. Current analysis results: git

Repository	Total Files	Total Tokens	Go Files	Go Tokens	Markdown Files	Markdown Tokens	Rust Files	Rust Tokens	Solidity Files	Solidity Tokens
celestia-core	1183	3,237,727	712	2,529,860	471	707,867	0	0	0	0
celestia-app	451	1,012,483	367	731,017	84	281,466	0	0	0	0
celestia-node	483	752,147	462	708,167	21	43,980	0	0	0	0
rsmt2d	14	40,074	13	38,942	1	1,132	0	0	0	0
optimism	3161	8,403,518	2596	6,570,577	106	144,136	0	0	459	1,688,805
nitro	742	2,110,710	581	1,598,154	13	7,132	148	505,424	0	0
nitro-contracts	169	397,834	0	0	2	2,125	0	0	167	395,709
nitro-das-celestia	39	627,586	17	599,290	2	3,864	0	0	20	24,432
docs	87	220,354	1	2,297	86	218,057	0	0	0	0
eq-service	24	51,596	0	0	7	10,237	17	41,359	0	0
CIPs	54	129,101	0	0	54	129,101	0	0	0	0
pda-proxy	18	45,003	0	0	3	5,860	15	39,143	0	0
hana	29	25,388	0	0	7	315	22	25,073	0	0
localestia	12	26,735	0	0	2	1,605	10	25,130	0	0
rollkit	328	825,569	221	593,746	92	156,590	15	75,233	0	0
zksync-era	2383	7,029,109	0	0	281	704,248	2025	6,228,643	77	96,218
dojo	213	741,717	0	0	14	7,312	199	734,405	0	0
weave	127	351,554	123	350,094	4	1,460	0	0	0	0
TOTAL	9517	26,028,205	5093	13,722,144	1250	2,426,487	2451	7,674,410	723	2,205,164

Last Updated: 2025-08-16 06:01:49 UTC

File Types: .go (Go source code) and .md (Markdown documentation)

🔧 Components

Core Files

tokenizer.py - Main Python script for token analysis
repos.txt - Configuration file listing repositories to analyze
.github/workflows/tokenizer.yml - GitHub Actions workflow for automation
pyproject.toml - Poetry dependency management

Key Features

GPT-2 Tokenizer - Consistent token counting across all content
Repository Cloning - Shallow clones for efficient analysis
File-level Analysis - Detailed breakdown of individual files
AI-Optimized Output - Metadata and usage instructions for AI agents
Automated Deployment - Daily updates and on-demand execution

🤖 AI Agent Usage

Quick Start

import requests

# Fetch token telemetry data
response = requests.get('https://celestiaorg.github.io/tokenmetry/index.json')
data = response.json()

# Get overall statistics
total_tokens = data['summary']['total_tokens']
total_repos = data['summary']['total_repositories']

# Discover available repositories
for repo in data['repositories']:
    name = repo['repository']['name']
    url = repo['repository']['url']
    tokens = repo['total_tokens']
    print(f"{name}: {tokens:,} tokens")

Context Management Strategies

Repository Prioritization

# Sort repositories by size for context planning
repos_by_size = sorted(data['repositories'], 
                      key=lambda x: x['total_tokens'])

File Analysis

# Find largest files that might need chunking
for repo in data['repositories']:
    large_files = [f for f in repo['files'] if f['tokens'] > 5000]

Language Breakdown

# Understand code vs documentation ratio
go_tokens = data['summary']['by_extension']['.go']['tokens']
md_tokens = data['summary']['by_extension']['.md']['tokens']

🛠️ Local Development

Prerequisites

Python 3.9+
Poetry
Git

Setup

# Clone the repository
git clone https://github.com/celestiaorg/tokenmetry.git
cd tokenmetry

# Install dependencies
poetry install

# Run analysis on all configured repositories
poetry run python tokenizer.py --celestia-repos --output results.json

# Analyze a single repository
poetry run python tokenizer.py --repo https://github.com/celestiaorg/celestia-app.git

# Analyze a local directory
poetry run python tokenizer.py --directory /path/to/repo

# Analyze a single file
poetry run python tokenizer.py --file example.go

Command Line Options

usage: tokenizer.py [-h] (--file FILE | --directory DIRECTORY | --repo REPO | --celestia-repos | --text TEXT)
                    [--repo-file REPO_FILE] [--output OUTPUT] [--verbose]

options:
  --file, -f           Path to file to tokenize
  --directory, -d      Path to directory to process
  --repo, -r           Repository URL to clone and process
  --celestia-repos     Process all repositories from repos.txt
  --text, -t           Text string to tokenize
  --repo-file          Path to repository list file (default: repos.txt)
  --output, -o         Output JSON file path
  --verbose, -v        Show detailed file-by-file results

📝 Configuration

Adding Repositories

Edit repos.txt to add or remove repositories:

# CelestiaOrg Repository List
# One repository URL per line
# Lines starting with # are comments

https://github.com/celestiaorg/celestia-core
https://github.com/celestiaorg/celestia-app
https://github.com/celestiaorg/celestia-node
https://github.com/celestiaorg/docs

Changes to repos.txt automatically trigger workflow runs.

GitHub Pages Setup

Go to repository Settings → Pages
Set Source to "GitHub Actions"
The workflow will handle deployment automatically

🔄 Automation

Workflow Triggers

📅 Scheduled: Daily at 6 AM UTC
👆 Manual: Via GitHub Actions UI ("Run workflow" button)
🔧 Automatic: On changes to:
- tokenizer.py
- .github/workflows/tokenizer.yml
- repos.txt

Manual Execution

Go to Actions tab in GitHub
Select "Token Telemetry" workflow
Click "Run workflow"
Select branch and click "Run workflow"

📊 Output Format

JSON Structure

{
  "metadata": {
    "generated_at": "2025-06-17T14:00:00Z",
    "purpose": "Token analysis for AI context management",
    "usage_instructions": { /* AI guidance */ },
    "data_structure": { /* Format explanation */ }
  },
  "summary": {
    "total_repositories": 4,
    "total_files": 2123,
    "total_tokens": 4917390,
    "by_extension": {
      ".go": { "files": 1553, "tokens": 3904304 },
      ".md": { "files": 570, "tokens": 1013086 }
    }
  },
  "repositories": [
    {
      "directory": "celestia-core",
      "repository": {
        "name": "celestia-core",
        "url": "https://github.com/celestiaorg/celestia-core"
      },
      "total_files": 1179,
      "total_tokens": 3202991,
      "by_extension": { /* breakdown by file type */ },
      "files": [ /* individual file analysis */ ]
    }
  ]
}

🔍 Monitoring

Workflow Status

Check workflow runs in the Actions tab to monitor:

Execution success/failure
Processing time
Token count changes over time

Error Handling

The system includes robust error handling:

Repository cloning failures are logged but don't stop other repositories
File encoding issues are skipped with warnings
Network timeouts are retried automatically

🤝 Contributing

Adding New File Types

Edit tokenizer.py to support additional file extensions:

# In count_tokens_in_file function
if extension not in ['.go', '.md', '.rs', '.py']:  # Add new extensions
    return 0, extension

Last Updated: Auto-generated daily at 6 AM UTC
API Endpoint: https://celestiaorg.github.io/tokenmetry/index.json

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
prompts		prompts
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
repos.txt		repos.txt
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Token Telemetry System

🎯 Purpose

🚀 Live API

📊 What's Analyzed

🔧 Components

Core Files

Key Features

🤖 AI Agent Usage

Quick Start

Context Management Strategies

🛠️ Local Development

Prerequisites

Setup

Command Line Options

📝 Configuration

Adding Repositories

GitHub Pages Setup

🔄 Automation

Workflow Triggers

Manual Execution

📊 Output Format

JSON Structure

🔍 Monitoring

Workflow Status

Error Handling

🤝 Contributing

Adding New File Types

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

celestiaorg/tokenmetry

Folders and files

Latest commit

History

Repository files navigation

Token Telemetry System

🎯 Purpose

🚀 Live API

📊 What's Analyzed

🔧 Components

Core Files

Key Features

🤖 AI Agent Usage

Quick Start

Context Management Strategies

🛠️ Local Development

Prerequisites

Setup

Command Line Options

📝 Configuration

Adding Repositories

GitHub Pages Setup

🔄 Automation

Workflow Triggers

Manual Execution

📊 Output Format

JSON Structure

🔍 Monitoring

Workflow Status

Error Handling

🤝 Contributing

Adding New File Types

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages