Skip to content

πŸš€ AI-optimized token telemetry system for Celestia repositories. Provides comprehensive token analysis with context window management guidance for AI agents.

Notifications You must be signed in to change notification settings

celestiaorg/tokenmetry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

55 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Token Telemetry System

A comprehensive token analysis system for CelestiaOrg repositories, designed to help AI agents understand codebase size, complexity, and structure for better context management and development assistance.

🎯 Purpose

This system automatically analyzes CelestiaOrg repositories and provides detailed token counts using the GPT-2 tokenizer. The results are published as a JSON API that AI agents can consume to make informed decisions about context window management, repository prioritization, and code analysis strategies.

πŸš€ Live API

Endpoint: https://celestiaorg.github.io/tokenmetry/index.json

The JSON includes comprehensive metadata, usage instructions, and detailed token analysis for all configured repositories.

πŸ“Š What's Analyzed

The system analyzes repositories configured in repos.txt. Current analysis results: git

Repository Total Files Total Tokens Go Files Go Tokens Markdown Files Markdown Tokens Rust Files Rust Tokens Solidity Files Solidity Tokens
celestia-core 1183 3,237,727 712 2,529,860 471 707,867 0 0 0 0
celestia-app 451 1,012,483 367 731,017 84 281,466 0 0 0 0
celestia-node 483 752,147 462 708,167 21 43,980 0 0 0 0
rsmt2d 14 40,074 13 38,942 1 1,132 0 0 0 0
optimism 3161 8,403,518 2596 6,570,577 106 144,136 0 0 459 1,688,805
nitro 742 2,110,710 581 1,598,154 13 7,132 148 505,424 0 0
nitro-contracts 169 397,834 0 0 2 2,125 0 0 167 395,709
nitro-das-celestia 39 627,586 17 599,290 2 3,864 0 0 20 24,432
docs 87 220,354 1 2,297 86 218,057 0 0 0 0
eq-service 24 51,596 0 0 7 10,237 17 41,359 0 0
CIPs 54 129,101 0 0 54 129,101 0 0 0 0
pda-proxy 18 45,003 0 0 3 5,860 15 39,143 0 0
hana 29 25,388 0 0 7 315 22 25,073 0 0
localestia 12 26,735 0 0 2 1,605 10 25,130 0 0
rollkit 328 825,569 221 593,746 92 156,590 15 75,233 0 0
zksync-era 2383 7,029,109 0 0 281 704,248 2025 6,228,643 77 96,218
dojo 213 741,717 0 0 14 7,312 199 734,405 0 0
weave 127 351,554 123 350,094 4 1,460 0 0 0 0
TOTAL 9517 26,028,205 5093 13,722,144 1250 2,426,487 2451 7,674,410 723 2,205,164

Last Updated: 2025-08-16 06:01:49 UTC

File Types: .go (Go source code) and .md (Markdown documentation)

πŸ”§ Components

Core Files

  • tokenizer.py - Main Python script for token analysis
  • repos.txt - Configuration file listing repositories to analyze
  • .github/workflows/tokenizer.yml - GitHub Actions workflow for automation
  • pyproject.toml - Poetry dependency management

Key Features

  • GPT-2 Tokenizer - Consistent token counting across all content
  • Repository Cloning - Shallow clones for efficient analysis
  • File-level Analysis - Detailed breakdown of individual files
  • AI-Optimized Output - Metadata and usage instructions for AI agents
  • Automated Deployment - Daily updates and on-demand execution

πŸ€– AI Agent Usage

Quick Start

import requests

# Fetch token telemetry data
response = requests.get('https://celestiaorg.github.io/tokenmetry/index.json')
data = response.json()

# Get overall statistics
total_tokens = data['summary']['total_tokens']
total_repos = data['summary']['total_repositories']

# Discover available repositories
for repo in data['repositories']:
    name = repo['repository']['name']
    url = repo['repository']['url']
    tokens = repo['total_tokens']
    print(f"{name}: {tokens:,} tokens")

Context Management Strategies

  1. Repository Prioritization

    # Sort repositories by size for context planning
    repos_by_size = sorted(data['repositories'], 
                          key=lambda x: x['total_tokens'])
  2. File Analysis

    # Find largest files that might need chunking
    for repo in data['repositories']:
        large_files = [f for f in repo['files'] if f['tokens'] > 5000]
  3. Language Breakdown

    # Understand code vs documentation ratio
    go_tokens = data['summary']['by_extension']['.go']['tokens']
    md_tokens = data['summary']['by_extension']['.md']['tokens']

πŸ› οΈ Local Development

Prerequisites

  • Python 3.9+
  • Poetry
  • Git

Setup

# Clone the repository
git clone https://github.com/celestiaorg/tokenmetry.git
cd tokenmetry

# Install dependencies
poetry install

# Run analysis on all configured repositories
poetry run python tokenizer.py --celestia-repos --output results.json

# Analyze a single repository
poetry run python tokenizer.py --repo https://github.com/celestiaorg/celestia-app.git

# Analyze a local directory
poetry run python tokenizer.py --directory /path/to/repo

# Analyze a single file
poetry run python tokenizer.py --file example.go

Command Line Options

usage: tokenizer.py [-h] (--file FILE | --directory DIRECTORY | --repo REPO | --celestia-repos | --text TEXT)
                    [--repo-file REPO_FILE] [--output OUTPUT] [--verbose]

options:
  --file, -f           Path to file to tokenize
  --directory, -d      Path to directory to process
  --repo, -r           Repository URL to clone and process
  --celestia-repos     Process all repositories from repos.txt
  --text, -t           Text string to tokenize
  --repo-file          Path to repository list file (default: repos.txt)
  --output, -o         Output JSON file path
  --verbose, -v        Show detailed file-by-file results

πŸ“ Configuration

Adding Repositories

Edit repos.txt to add or remove repositories:

# CelestiaOrg Repository List
# One repository URL per line
# Lines starting with # are comments

https://github.com/celestiaorg/celestia-core
https://github.com/celestiaorg/celestia-app
https://github.com/celestiaorg/celestia-node
https://github.com/celestiaorg/docs

Changes to repos.txt automatically trigger workflow runs.

GitHub Pages Setup

  1. Go to repository Settings β†’ Pages
  2. Set Source to "GitHub Actions"
  3. The workflow will handle deployment automatically

πŸ”„ Automation

Workflow Triggers

  • πŸ“… Scheduled: Daily at 6 AM UTC
  • πŸ‘† Manual: Via GitHub Actions UI ("Run workflow" button)
  • πŸ”§ Automatic: On changes to:
    • tokenizer.py
    • .github/workflows/tokenizer.yml
    • repos.txt

Manual Execution

  1. Go to Actions tab in GitHub
  2. Select "Token Telemetry" workflow
  3. Click "Run workflow"
  4. Select branch and click "Run workflow"

πŸ“Š Output Format

JSON Structure

{
  "metadata": {
    "generated_at": "2025-06-17T14:00:00Z",
    "purpose": "Token analysis for AI context management",
    "usage_instructions": { /* AI guidance */ },
    "data_structure": { /* Format explanation */ }
  },
  "summary": {
    "total_repositories": 4,
    "total_files": 2123,
    "total_tokens": 4917390,
    "by_extension": {
      ".go": { "files": 1553, "tokens": 3904304 },
      ".md": { "files": 570, "tokens": 1013086 }
    }
  },
  "repositories": [
    {
      "directory": "celestia-core",
      "repository": {
        "name": "celestia-core",
        "url": "https://github.com/celestiaorg/celestia-core"
      },
      "total_files": 1179,
      "total_tokens": 3202991,
      "by_extension": { /* breakdown by file type */ },
      "files": [ /* individual file analysis */ ]
    }
  ]
}

πŸ” Monitoring

Workflow Status

Check workflow runs in the Actions tab to monitor:

  • Execution success/failure
  • Processing time
  • Token count changes over time

Error Handling

The system includes robust error handling:

  • Repository cloning failures are logged but don't stop other repositories
  • File encoding issues are skipped with warnings
  • Network timeouts are retried automatically

🀝 Contributing

Adding New File Types

Edit tokenizer.py to support additional file extensions:

# In count_tokens_in_file function
if extension not in ['.go', '.md', '.rs', '.py']:  # Add new extensions
    return 0, extension

Last Updated: Auto-generated daily at 6 AM UTC
API Endpoint: https://celestiaorg.github.io/tokenmetry/index.json

About

πŸš€ AI-optimized token telemetry system for Celestia repositories. Provides comprehensive token analysis with context window management guidance for AI agents.

Topics

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages