Record, replay, and debug Kubernetes operator reconciliation loops with time-travel debugging
A production-grade tool for recording, replaying, and analyzing Kubernetes operator reconciliation loops. Helps debug operator behavior by capturing all API interactions and enabling time-travel debugging.
- Recording Mode: Transparently record all Kubernetes API operations
- Replay Mode: Step through recorded operations forward/backward
- Analysis Mode: Detect loops, slow operations, and error patterns
- Safety-Critical: Follows JPL Power of 10 coding rules
- Zero Dependencies: Self-contained SQLite storage
- Time Travel: Navigate through operation history
┌─────────────────────────────────────┐
│ 1. Recording Mode │
│ - Intercept all K8s API calls │
│ - Record: events, state, timing │
│ - Store in SQLite database │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 2. Replay Mode │
│ - Mock K8s API server │
│ - Feed recorded events │
│ - Step through reconciliation │
│ - Time travel (rewind/forward) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 3. Analysis Mode │
│ - Show state diff at each step │
│ - Identify infinite loops │
│ - Find race conditions │
│ - Performance bottlenecks │
└─────────────────────────────────────┘
- Go 1.21 or later
- GCC (for SQLite CGO)
- Linux Mint or similar distribution
git clone https://github.com/your-org/k8s-operator-replay-debugger
cd k8s-operator-replay-debugger
# Install dependencies
go mod download
# Build the CLI
go build -o replay-cli ./cmd/replay-cli
# Run tests
go test ./...Integrate the recording client into your operator:
package main
import (
"context"
"github.com/operator-replay-debugger/pkg/recorder"
"github.com/operator-replay-debugger/pkg/storage"
"k8s.io/client-go/kubernetes"
)
func main() {
// Create your normal Kubernetes client
k8sClient := kubernetes.NewForConfigOrDie(config)
// Open recording database
db, err := storage.NewDatabase("recordings.db", 1000000)
if err != nil {
panic(err)
}
defer db.Close()
// Wrap client with recorder
recordingClient, err := recorder.NewRecordingClient(recorder.Config{
Client: k8sClient,
Database: db,
SessionID: "prod-deployment-001",
MaxSequence: 1000000,
})
if err != nil {
panic(err)
}
// Use recording client for operations
pod, err := recordingClient.RecordGet(
context.Background(),
"Pod",
"default",
"my-pod",
metav1.GetOptions{},
)
}# List available sessions
./replay-cli sessions -d recordings.db
# Replay a session automatically
./replay-cli replay prod-deployment-001 -d recordings.db
# Interactive replay with step controls
./replay-cli replay prod-deployment-001 -d recordings.db -i
# Interactive commands:
# n - step forward
# b - step backward
# r - reset to beginning
# s - show statistics
# q - quit# Detect loops, slow operations, and errors
./replay-cli analyze prod-deployment-001 -d recordings.db
# Only detect loops
./replay-cli analyze prod-deployment-001 -d recordings.db --loops --no-slow --no-errors
# Custom thresholds
./replay-cli analyze prod-deployment-001 -d recordings.db \
--threshold 2000 \
--window 15CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL,
sequence_number INTEGER NOT NULL,
timestamp INTEGER NOT NULL,
operation_type TEXT NOT NULL,
resource_kind TEXT NOT NULL,
namespace TEXT,
name TEXT,
resource_data TEXT,
error TEXT,
duration_ms INTEGER NOT NULL
);This project follows the JPL Power of 10 rules for safety-critical code:
- No recursion - All algorithms use iteration
- Bounded loops - All loops have explicit upper bounds
- No dynamic allocation after init - Memory allocated during setup only
- Functions under 60 lines - Each function is a logical unit
- Minimum 2 assertions per function - Defensive programming
- Minimal scope - Variables declared at smallest scope
- Return values checked - All errors propagated
- Limited preprocessor - No token pasting or recursion
- Single-level pointers - No multiple indirection
- Zero warnings - Compiles cleanly with all warnings enabled
# Run all tests
go test ./...
# Run with verbose output
go test -v ./...
# Run specific package tests
go test -v ./pkg/storage
go test -v ./pkg/replay
go test -v ./pkg/analysis
# Run with race detector
go test -race ./...
# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.outk8s-operator-replay-debugger/
├── cmd/
│ └── replay-cli/
│ ├── main.go # CLI entry point
│ └── commands/ # Subcommands
│ ├── replay.go # Replay operations
│ ├── analyze.go # Analysis tools
│ └── record.go # Recording info
├── pkg/
│ ├── recorder/ # Recording client
│ │ └── client.go
│ ├── replay/ # Replay engine
│ │ └── engine.go
│ ├── storage/ # Database layer
│ │ ├── types.go
│ │ └── database.go
│ └── analysis/ # Analysis tools
│ └── analyzer.go
├── internal/
│ └── assert/ # Assertion utilities
│ └── assert.go
├── go.mod
├── go.sum
├── README.md
└── ARCHITECTURE.md
# Database path
export REPLAY_DB_PATH="recordings.db"
# Maximum operations per session
export REPLAY_MAX_OPS=1000000
# Slow operation threshold (ms)
export REPLAY_SLOW_THRESHOLD=1000Record operations in production, then replay locally to investigate:
# In production
recordingClient.Enable()
# ... issue occurs ...
recordingClient.Disable()
# Copy recordings.db to local machine
# Replay locally
./replay-cli replay prod-issue-123 -iFind slow operations causing bottlenecks:
./replay-cli analyze session-001 --slow --threshold 500Identify infinite reconciliation loops:
./replay-cli analyze session-001 --loops --window 10Understand error frequency and types:
./replay-cli analyze session-001 --errors- SQLite-based storage (single file, not distributed)
- Maximum 1M operations per session by default
- No real-time streaming (batch recording)
- Resource data limited to 1MB per operation
- Requires CGO for SQLite (not pure Go)
Contributions must follow the safety-critical coding standards:
- All functions under 60 lines
- Minimum 2 assertions per function
- All loops explicitly bounded
- No recursion
- Zero compiler warnings
- Tests for all new functionality
MIT License - See LICENSE file for details