A comprehensive token analysis system for CelestiaOrg repositories, designed to help AI agents understand codebase size, complexity, and structure for better context management and development assistance.
This system automatically analyzes CelestiaOrg repositories and provides detailed token counts using the GPT-2 tokenizer. The results are published as a JSON API that AI agents can consume to make informed decisions about context window management, repository prioritization, and code analysis strategies.
Endpoint: https://celestiaorg.github.io/tokenmetry/index.json
The JSON includes comprehensive metadata, usage instructions, and detailed token analysis for all configured repositories.
The system analyzes repositories configured in repos.txt
. Current analysis results:
git
Repository | Total Files | Total Tokens | Go Files | Go Tokens | Markdown Files | Markdown Tokens | Rust Files | Rust Tokens | Solidity Files | Solidity Tokens |
---|---|---|---|---|---|---|---|---|---|---|
celestia-core | 1183 | 3,237,727 | 712 | 2,529,860 | 471 | 707,867 | 0 | 0 | 0 | 0 |
celestia-app | 451 | 1,012,483 | 367 | 731,017 | 84 | 281,466 | 0 | 0 | 0 | 0 |
celestia-node | 483 | 752,147 | 462 | 708,167 | 21 | 43,980 | 0 | 0 | 0 | 0 |
rsmt2d | 14 | 40,074 | 13 | 38,942 | 1 | 1,132 | 0 | 0 | 0 | 0 |
optimism | 3161 | 8,403,518 | 2596 | 6,570,577 | 106 | 144,136 | 0 | 0 | 459 | 1,688,805 |
nitro | 742 | 2,110,710 | 581 | 1,598,154 | 13 | 7,132 | 148 | 505,424 | 0 | 0 |
nitro-contracts | 169 | 397,834 | 0 | 0 | 2 | 2,125 | 0 | 0 | 167 | 395,709 |
nitro-das-celestia | 39 | 627,586 | 17 | 599,290 | 2 | 3,864 | 0 | 0 | 20 | 24,432 |
docs | 87 | 220,354 | 1 | 2,297 | 86 | 218,057 | 0 | 0 | 0 | 0 |
eq-service | 24 | 51,596 | 0 | 0 | 7 | 10,237 | 17 | 41,359 | 0 | 0 |
CIPs | 54 | 129,101 | 0 | 0 | 54 | 129,101 | 0 | 0 | 0 | 0 |
pda-proxy | 18 | 45,003 | 0 | 0 | 3 | 5,860 | 15 | 39,143 | 0 | 0 |
hana | 29 | 25,388 | 0 | 0 | 7 | 315 | 22 | 25,073 | 0 | 0 |
localestia | 12 | 26,735 | 0 | 0 | 2 | 1,605 | 10 | 25,130 | 0 | 0 |
rollkit | 328 | 825,569 | 221 | 593,746 | 92 | 156,590 | 15 | 75,233 | 0 | 0 |
zksync-era | 2383 | 7,029,109 | 0 | 0 | 281 | 704,248 | 2025 | 6,228,643 | 77 | 96,218 |
dojo | 213 | 741,717 | 0 | 0 | 14 | 7,312 | 199 | 734,405 | 0 | 0 |
weave | 127 | 351,554 | 123 | 350,094 | 4 | 1,460 | 0 | 0 | 0 | 0 |
TOTAL | 9517 | 26,028,205 | 5093 | 13,722,144 | 1250 | 2,426,487 | 2451 | 7,674,410 | 723 | 2,205,164 |
Last Updated: 2025-08-16 06:01:49 UTC
File Types: .go
(Go source code) and .md
(Markdown documentation)
tokenizer.py
- Main Python script for token analysisrepos.txt
- Configuration file listing repositories to analyze.github/workflows/tokenizer.yml
- GitHub Actions workflow for automationpyproject.toml
- Poetry dependency management
- GPT-2 Tokenizer - Consistent token counting across all content
- Repository Cloning - Shallow clones for efficient analysis
- File-level Analysis - Detailed breakdown of individual files
- AI-Optimized Output - Metadata and usage instructions for AI agents
- Automated Deployment - Daily updates and on-demand execution
import requests
# Fetch token telemetry data
response = requests.get('https://celestiaorg.github.io/tokenmetry/index.json')
data = response.json()
# Get overall statistics
total_tokens = data['summary']['total_tokens']
total_repos = data['summary']['total_repositories']
# Discover available repositories
for repo in data['repositories']:
name = repo['repository']['name']
url = repo['repository']['url']
tokens = repo['total_tokens']
print(f"{name}: {tokens:,} tokens")
-
Repository Prioritization
# Sort repositories by size for context planning repos_by_size = sorted(data['repositories'], key=lambda x: x['total_tokens'])
-
File Analysis
# Find largest files that might need chunking for repo in data['repositories']: large_files = [f for f in repo['files'] if f['tokens'] > 5000]
-
Language Breakdown
# Understand code vs documentation ratio go_tokens = data['summary']['by_extension']['.go']['tokens'] md_tokens = data['summary']['by_extension']['.md']['tokens']
- Python 3.9+
- Poetry
- Git
# Clone the repository
git clone https://github.com/celestiaorg/tokenmetry.git
cd tokenmetry
# Install dependencies
poetry install
# Run analysis on all configured repositories
poetry run python tokenizer.py --celestia-repos --output results.json
# Analyze a single repository
poetry run python tokenizer.py --repo https://github.com/celestiaorg/celestia-app.git
# Analyze a local directory
poetry run python tokenizer.py --directory /path/to/repo
# Analyze a single file
poetry run python tokenizer.py --file example.go
usage: tokenizer.py [-h] (--file FILE | --directory DIRECTORY | --repo REPO | --celestia-repos | --text TEXT)
[--repo-file REPO_FILE] [--output OUTPUT] [--verbose]
options:
--file, -f Path to file to tokenize
--directory, -d Path to directory to process
--repo, -r Repository URL to clone and process
--celestia-repos Process all repositories from repos.txt
--text, -t Text string to tokenize
--repo-file Path to repository list file (default: repos.txt)
--output, -o Output JSON file path
--verbose, -v Show detailed file-by-file results
Edit repos.txt
to add or remove repositories:
# CelestiaOrg Repository List
# One repository URL per line
# Lines starting with # are comments
https://github.com/celestiaorg/celestia-core
https://github.com/celestiaorg/celestia-app
https://github.com/celestiaorg/celestia-node
https://github.com/celestiaorg/docs
Changes to repos.txt
automatically trigger workflow runs.
- Go to repository Settings β Pages
- Set Source to "GitHub Actions"
- The workflow will handle deployment automatically
- π Scheduled: Daily at 6 AM UTC
- π Manual: Via GitHub Actions UI ("Run workflow" button)
- π§ Automatic: On changes to:
tokenizer.py
.github/workflows/tokenizer.yml
repos.txt
- Go to Actions tab in GitHub
- Select "Token Telemetry" workflow
- Click "Run workflow"
- Select branch and click "Run workflow"
{
"metadata": {
"generated_at": "2025-06-17T14:00:00Z",
"purpose": "Token analysis for AI context management",
"usage_instructions": { /* AI guidance */ },
"data_structure": { /* Format explanation */ }
},
"summary": {
"total_repositories": 4,
"total_files": 2123,
"total_tokens": 4917390,
"by_extension": {
".go": { "files": 1553, "tokens": 3904304 },
".md": { "files": 570, "tokens": 1013086 }
}
},
"repositories": [
{
"directory": "celestia-core",
"repository": {
"name": "celestia-core",
"url": "https://github.com/celestiaorg/celestia-core"
},
"total_files": 1179,
"total_tokens": 3202991,
"by_extension": { /* breakdown by file type */ },
"files": [ /* individual file analysis */ ]
}
]
}
Check workflow runs in the Actions tab to monitor:
- Execution success/failure
- Processing time
- Token count changes over time
The system includes robust error handling:
- Repository cloning failures are logged but don't stop other repositories
- File encoding issues are skipped with warnings
- Network timeouts are retried automatically
Edit tokenizer.py
to support additional file extensions:
# In count_tokens_in_file function
if extension not in ['.go', '.md', '.rs', '.py']: # Add new extensions
return 0, extension
Last Updated: Auto-generated daily at 6 AM UTC
API Endpoint: https://celestiaorg.github.io/tokenmetry/index.json