Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .claude/skills/concurrency-check/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
name: concurrency-check
description: Review Hudi multi-writer and locking config. Use when dealing with write conflicts, lock timeouts, OCC, or concurrent writers.
user-invocable: true
allowed-tools: Read, Grep, Glob, Agent
argument-hint: [scenario e.g. "multi-writer setup", "lock timeout", "write conflict", "OCC"]
---

# Hudi Concurrency Control Check

Scenario: **$ARGUMENTS**

## Instructions

### Key source files:
- Lock config: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java`
- Write config concurrency: search for `WRITE_CONCURRENCY_MODE` in `HoodieWriteConfig.java`
- Lock providers: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/`
- Conflict resolution: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/`
- Early conflict detection: search for `EarlyConflictDetection` classes

### Concurrency modes:
1. **SINGLE_WRITER** (default) - No locking needed, single writer assumed
2. **OPTIMISTIC_CONCURRENCY_CONTROL** - Multiple writers, OCC with conflict detection at commit time

### For multi-writer setup review:
1. Check these configs are set correctly:
- `hoodie.write.concurrency.mode=optimistic_concurrency_control`
- `hoodie.write.lock.provider` - Must be set (not filesystem-based for production)
- Lock timeout and retry configs
2. Recommended lock providers by environment:
- **ZooKeeper**: `org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider`
- **DynamoDB**: `org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider`
- **HiveMetastore**: `org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider`
- **Storage-based**: Only for testing, NOT production multi-writer
3. Early conflict detection: `hoodie.write.concurrency.early.conflict.detection.enable`

### For write conflict debugging:
1. Check the `HoodieWriteConflictException` stack trace
2. Identify which file groups are in conflict
3. Check if both writers were modifying the same partition
4. Review conflict resolution strategy: `hoodie.write.concurrency.conflict.resolution.strategy`

### For lock timeout issues:
1. Check `hoodie.write.lock.wait_time_ms` (default varies by provider)
2. Check `hoodie.write.lock.num_retries`
3. Check if long-running table services hold locks
4. Heartbeat config: `hoodie.write.lock.heartbeat_interval_ms`

### Critical rules from code review guidelines:
- Lock must be held when requesting AND completing timeline actions
- Timeline must be refreshed inside lock's critical section
- Locks must be released properly (check finally blocks)
- Non-atomic check-then-act on distributed storage is a bug

### Output:
1. **Current config assessment** - what's configured and is it correct
2. **Issues found** - misconfigurations or risks
3. **Recommended config** - complete config block for the scenario
4. **Testing approach** - how to validate multi-writer behavior
71 changes: 71 additions & 0 deletions .claude/skills/debug-table-service/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
name: debug-table-service
description: Debug compaction, clustering, or cleaning issues. Use when table services are stuck, slow, failing, or producing bad results.
user-invocable: true
allowed-tools: Read, Grep, Glob, Bash, Agent
argument-hint: [symptom e.g. "compaction stuck", "clustering OOM", "too many small files after clustering"]
---

# Debug Hudi Table Service

Problem: **$ARGUMENTS**

## Instructions

Identify which table service is involved (compaction, clustering, or cleaning) and debug it.

### Key source locations:
- Compaction: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/`
- Clustering: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/`
- Cleaning: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/`
- Configs: `HoodieCompactionConfig.java`, `HoodieClusteringConfig.java`, `HoodieCleanConfig.java`

### Diagnostic Steps

#### 1. Check service state
```sql
CALL show_compaction(path => '<path>', limit => 20);
CALL show_clustering(path => '<path>', limit => 20);
CALL show_cleans(path => '<path>', limit => 10);
CALL show_timeline(path => '<path>', limit => 50);
```

#### 2. Common compaction issues (MoR only)
- **Not scheduling**: Check `hoodie.compact.inline` and `hoodie.compact.inline.max.delta.commits` (default: 5)
- **Stuck INFLIGHT**: Writer crashed. Rollback: `CALL rollback_to_instant_time(table => '<name>', instant_time => '<time>');`
- **OOM**: Large log files. Tune `hoodie.memory.merge.max.size`, consider `BoundedIOCompactionStrategy`
- **Too many pending**: Writes outpacing compaction. Increase parallelism or move to async
- **Small files after**: Check `hoodie.parquet.max.file.size` (default: 120MB)
- **Log compaction**: `hoodie.compact.inline.log.compact` for intermediate log merging

#### 3. Common clustering issues
- **Not scheduling**: Check `hoodie.clustering.inline` / `hoodie.clustering.async.enabled` and trigger threshold
- **Not improving queries**: Verify `hoodie.clustering.plan.strategy.sort.columns` aligns with query filters
- **OOM**: Reduce `max.bytes.per.group` and `max.num.groups`
- **Conflicts**: Clustering creates REPLACE_COMMIT; check concurrent write conflicts

#### 4. Common cleaning issues
- **Not running**: Check `hoodie.clean.automatic` (default: true)
- **Not freeing space**: Old file versions retained by `hoodie.cleaner.commits.retained` (default: 10)
- **Cleaning blocked**: Pending compaction/clustering blocks cleaning of involved file groups

#### 5. Filesystem-level checks (no Spark needed, adapt for gs://, az://, or hdfs://)
```bash
# Count pending compactions/clustering
aws s3 ls s3://my-bucket/my_table/.hoodie/ | grep -c '.compaction.requested'
aws s3 ls s3://my-bucket/my_table/.hoodie/ | grep -c '.compaction.inflight'
aws s3 ls s3://my-bucket/my_table/.hoodie/ | grep -c '.replacecommit.requested'

# Check log file sizes in a partition (MoR)
aws s3 ls s3://my-bucket/my_table/<partition>/ | grep '\.log'

# Check heartbeat for stuck writers
aws s3 ls s3://my-bucket/my_table/.hoodie/.heartbeat/
```

### Output
1. **Root cause** with evidence
2. **Fix** — specific commands with safety markers (`[SAFE]`/`[CAUTION]`/`[DANGEROUS]`)
3. **Verification** — commands to confirm the fix worked
4. **Prevention** — configs to avoid recurrence
5. **Escalation** — flag if issue indicates deeper problems (data corruption, version bugs)
107 changes: 107 additions & 0 deletions .claude/skills/diagnose-table/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
name: diagnose-table
description: Health check a Hudi table. Use when a table is slow, has stuck operations, small files, or needs a diagnostic.
user-invocable: true
disable-model-invocation: true
allowed-tools: Read, Grep, Glob, Bash, Agent
argument-hint: [table-path or symptoms e.g. "/data/my_table" or "compaction stuck"]
---

# Diagnose Hudi Table

User's input: **$ARGUMENTS**

## Instructions

You are a Hudi production operations expert. Help diagnose the table's health.

### If a table path is provided:
Run these diagnostic checks using Spark SQL CALL procedures or direct filesystem inspection:

#### 1. Timeline Health
```sql
-- Show recent commits
CALL show_commits(path => '<table_path>', limit => 20);
-- Show timeline with all action types
CALL show_timeline(path => '<table_path>', limit => 50);
```
Look for:
- Gaps in commit times (indicates failed writes)
- INFLIGHT instants that never completed (stuck operations)
- REQUESTED compactions/clusterings that were never executed
- Ratio of delta_commits to compactions (for MoR tables)

#### 2. Pending Operations
```sql
-- Pending compactions
CALL show_compaction(path => '<table_path>', limit => 50);
-- Pending clustering
CALL show_clustering(path => '<table_path>', limit => 50);
```
Flag: More than 5 pending compactions means compaction is falling behind writes.

#### 3. File System Health
```sql
-- File sizes per partition
CALL stats_file_size(table => '<table_name>');
-- Write amplification
CALL stats_write_amplification(table => '<table_name>');
-- Invalid parquet files
CALL show_invalid_parquet(path => '<table_path>');
```

#### 4. Metadata Table Health
```sql
-- Metadata table stats
CALL show_metadata_table_stats(table => '<table_name>');
-- Validate metadata consistency
CALL validate_metadata_table_files(table => '<table_name>');
```

#### 5. Clean & Archive Status
```sql
CALL show_cleans(path => '<table_path>', limit => 10);
```
Check: Is cleaning keeping up? Are old file versions accumulating?

#### 6. Filesystem-level checks (when Spark SQL is unavailable, adapt for gs://, az://, or hdfs://)
```bash
# Table properties
aws s3 cp s3://my-bucket/my_table/.hoodie/hoodie.properties -

# Recent timeline instants
aws s3 ls s3://my-bucket/my_table/.hoodie/ | grep -E '\.(commit|deltacommit|compaction|clean)' | sort | tail -20

# Stuck INFLIGHT operations
aws s3 ls s3://my-bucket/my_table/.hoodie/ | grep '.inflight'

# Heartbeat files (stuck writers)
aws s3 ls s3://my-bucket/my_table/.hoodie/.heartbeat/

# Marker files (incomplete writes)
aws s3 ls s3://my-bucket/my_table/.hoodie/.temp/

# Timeline directory size
aws s3 ls --summarize --recursive s3://my-bucket/my_table/.hoodie/ | tail -2
```

### If symptoms are described:
Map symptoms to likely causes:

| Symptom | Likely Cause | Check | Urgency |
|---------|-------------|-------|---------|
| Slow reads | Too many small files, missing compaction | File sizes, pending compactions | P2 |
| Slow writes | Lock contention, too many inline services | Lock config, inline service configs | P1 |
| OOM during compaction | Large log files, wrong memory config | Log file sizes, `hoodie.memory.merge.max.size` | P1 |
| Stuck INFLIGHT | Writer crashed mid-operation | Heartbeat files, rollback needed | P1 |
| Growing .hoodie dir | Archival not keeping up | Archive config, `hoodie.keep.max.commits` | P3 |
| Query returns stale data | Sync lag, metadata stale | Metadata table health, sync status | P2 |
| Missing records | Rollback/failed commit/schema issue | show_rollbacks, schema history | P1 |
| Spark job OOM | Record/file sizes, merge memory | Executor memory, log file sizes | P1 |

### Output format:
1. **Table State Summary** - Key metrics at a glance
2. **Issues Found** - Ordered by severity (P1 critical → P3 low)
3. **Recommended Actions** - Specific commands with safety markers: `[SAFE]` read-only, `[CAUTION]` mutating, `[DANGEROUS]` potential data loss
4. **Verification** - Commands to confirm the fix worked
5. **Preventive Configs** - Settings to prevent recurrence
49 changes: 49 additions & 0 deletions .claude/skills/explain-hudi/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
name: explain-hudi
description: Explain any Hudi concept, architecture, or CoW vs MoR tradeoffs from source code. Use when asking "what is", "how does", "explain", or "compare CoW MoR".
user-invocable: true
allowed-tools: Read, Grep, Glob, Agent
argument-hint: [concept e.g. "compaction", "file groups", "metadata table", "record-level index"]
---

# Explain Hudi Concept

The user wants to understand a Hudi concept: **$ARGUMENTS**

## Instructions

You are an Apache Hudi expert. Explain the concept by actually reading the relevant source code in this repository, not from memory. Ground every explanation in actual code references.

### Step 1: Identify the relevant code
Search the codebase for classes, interfaces, and configs related to the concept. Key locations:
- Core abstractions: `hudi-common/src/main/java/org/apache/hudi/common/`
- Client/write path: `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/`
- Spark integration: `hudi-spark-datasource/`
- Configs: classes ending in `Config.java` under `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/` and `hudi-common/src/main/java/org/apache/hudi/common/config/`

### For architecture questions, also cover:
- Module layout: `hudi-io/` (IO), `hudi-common/` (core), `hudi-client/` (write), `hudi-spark-datasource/` (Spark), `hudi-flink-datasource/` (Flink), `hudi-utilities/` (tools), `hudi-sync/` (metastore sync), `hudi-hadoop-common/` (storage), `packaging/` (bundles)
- Key abstractions and interfaces, extension points, design patterns
- Layering rules: `hudi-common` cannot depend on engine-specific code

### For CoW vs MoR comparisons, ground in:
- CoW commit: `BaseCommitActionExecutor` (rewrites full Parquet files)
- MoR append: `HoodieAppendHandle` (appends to log files)
- File group reader: `HoodieFileGroupReader.java` (merges base + logs)
- Table type enum: `HoodieTableType.java`

### Step 2: Build the explanation
Structure your explanation as:
1. **What it is** - 2-3 sentence definition
2. **Why it exists** - The problem it solves
3. **How it works** - Walk through the key classes and methods with file:line references
4. **Key configs** - List the most important configuration knobs (config key, default, what it controls)
5. **CoW vs MoR** - How behavior differs between Copy-on-Write and Merge-on-Read tables (if applicable)
6. **Common gotchas** - Things that trip up production users

### Step 3: Provide actionable examples
- Show relevant Spark SQL or DataFrame API usage
- Show relevant CALL procedures if they exist (check `hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/`)
- Show relevant CLI commands if they exist (check `hudi-cli/src/main/java/org/apache/hudi/cli/commands/`)

Keep the tone practical and production-focused. Avoid academic explanations - focus on what a production user needs to know.
61 changes: 61 additions & 0 deletions .claude/skills/file-sizing/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
name: file-sizing
description: Fix Hudi file sizing and write performance. Use when dealing with small files, large files, write amplification, or file size optimization.
user-invocable: true
allowed-tools: Read, Grep, Glob, Agent
argument-hint: [scenario e.g. "too many small files", "write amplification high", "large log files"]
---

# Hudi File Sizing & Performance Tuning

Scenario: **$ARGUMENTS**

## Instructions

### Key concepts (read source code for details):
- **File group**: A set of files (1 base + N log files) sharing a file ID in a partition
- **File slice**: A specific version of a file group at a point in time
- **Small file**: Base file smaller than `hoodie.parquet.small.file.limit` (default: 104857600 = 100MB)
- **Target file size**: `hoodie.parquet.max.file.size` (default: 125829120 = 120MB)

### File sizing configs to analyze:
Read `HoodieStorageConfig.java` and `HoodieWriteConfig.java` for these:
- `hoodie.parquet.max.file.size` - Target base file size
- `hoodie.parquet.small.file.limit` - Below this, file is "small" and gets more inserts routed to it
- `hoodie.record.size.estimation.threshold` - Min records to estimate average record size
- `hoodie.copy.on.write.record.size.estimate` - Fallback record size estimate
- `hoodie.logfile.max.size` - Max log file size before rolling (default: 1GB)
- `hoodie.logfile.data.block.max.size` - Max data block size in log file

### For small file problems:
1. **Diagnosis**: Check file size distribution with `CALL stats_file_size(table => '<name>');`
2. **Root causes**:
- Too many partitions relative to data volume (partition explosion)
- Writers with too much parallelism creating many small files
- Frequent small batch writes
- Compaction producing small files when data is delete-heavy
3. **Solutions by table type**:
- **CoW**: Hudi's small file handling routes inserts to existing small files. Tune `small.file.limit` and `max.file.size`
- **MoR**: Small base files grow via log files. Compaction creates properly-sized new base files
4. **Clustering as fix**: After-the-fact reorganization to merge small files

### For write amplification:
1. Check with `CALL stats_write_amplification(table => '<name>');`
2. High WA causes:
- CoW table with many updates (every update rewrites entire base file)
- Switch to MoR for update-heavy workloads
- Reduce `hoodie.parquet.max.file.size` to limit rewrite cost
3. MoR WA is mainly from compaction - tune compaction frequency

### For large log files:
1. Too many delta commits between compactions
2. Reduce `hoodie.compact.inline.max.delta.commits`
3. Consider log compaction as intermediate step
4. Increase compaction throughput (more resources)

### Output:
1. **Current state assessment**
2. **Root cause** of the sizing issue
3. **Recommended config changes** with specific values
4. **Expected improvement** and tradeoffs
5. **Monitoring** - how to verify the fix is working
Loading
Loading