Skip to content

Merge master into recursion-v2 (bump-p3 + conflict resolution)#1361

Open
kunxian-xia wants to merge 30 commits into
feat/recursion-v2from
feat/recursion-v2-merge-resolve
Open

Merge master into recursion-v2 (bump-p3 + conflict resolution)#1361
kunxian-xia wants to merge 30 commits into
feat/recursion-v2from
feat/recursion-v2-merge-resolve

Conversation

@kunxian-xia

Copy link
Copy Markdown
Collaborator

Summary

  • Merge origin/master into feat/recursion-v2, resolving all merge conflicts
  • Pin gkr-backend to feat/bump-p3 branch, bump p3-field to 0.4.3
  • Apply p3 API renames: from_canonical_*from_*, FieldAlgebraPrimeCharacteristicRing
  • Take HEAD (recursion-v2) versions for prover/verifier architecture changes

Note

Build has remaining structural API mismatches (from_canonical_* in non-conflicted files, missing struct fields) that need follow-up work.

🤖 Generated with Claude Code

dreamATD and others added 30 commits April 9, 2026 04:54
# Description
`n_challenges` on `Layer`, `Chip`, and `GKRCircuit` was always 0 — every
layout struct initialized it to 0 and never assigned any other value.
The design intended per-layer challenge sampling, but it was never
actually used.

This PR removes the field and all its plumbing:
- `Layer.n_challenges` field and the `update_challenges` method it drove
- `n_challenges` parameter from `Layer::new`,
`Layer::from_circuit_builder`, `LayerConstraintSystem::into_layer`, and
`LayerConstraintSystem::into_layer_with_lookup_eval_iter`
- `Chip.n_challenges` and the corresponding `Chip::new_from_cb`
parameter
- `GKRCircuit.n_challenges`
- `ProtocolBuilder::n_challenges()` trait method
- `generate_layer_challenges` in the recursion verifier (replaced by
passing `challenges` directly)

No behavior change — `sample_and_append_challenge_pows(0, ...)` was a
no-op, and `update_challenges` with n=0 neither sampled nor wrote
anything meaningful.

---------

Co-authored-by: sphere <sphere@scroll.io>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary

This PR simplifies the chip `finalize` flow by moving layer construction
into each chip/layout's `finalize` implementation, so callers no longer
need to manually compute `out_evals`, build a `Layer`, and call
`chip.add_layer(...)`.

## What changed

- Added a shared `default_out_eval_groups(...)` helper in `gkr_iop`.
- Updated `finalize(...)` implementations to directly assemble and
return a fully built `Chip`.
- Removed duplicated caller-side boilerplate in precompile, instruction,
and table builders.

## Why

This makes the finalize contract simpler and more consistent: callers
ask for a finalized chip and receive a ready-to-use one, instead of
partially constructing it in multiple places.

## Validation

- `cargo check`
- `cargo check --all-targets`

---------

Co-authored-by: sphere <sphere@scroll.io>
- add copilot review prompt
- only whiltelist ceno-book deploy when change `docs/**` and after merge
#1299)

A pre-requisites PR before batch main-sumcheck across "all" chip.
This PR refactor rotation and unify the gkr-circuit main-sumcheck flow. 


### What changed (high level)
This PR reshapes the proving/verification pipeline so **rotation
constraints are handled at chip level** instead of being embedded in
each GKR layer proof.
Core effect: rotation logic is now a first-class chip-proof component
(`rotation_proof`) and selector/eval wiring is unified across CPU, GPU,
native verifier, and recursion verifier.

Changed areas are mainly:
- `ceno_zkvm/src/scheme/*` (prover/verifier flow and proof struct)
- `gkr_iop/src/gkr/layer/*` (layer construction, selector grouping,
zerocheck/sumcheck behavior)
- `ceno_recursion/src/zkvm_verifier/*` (recursive proof
input/verification alignment)

---

### Major proving-flow changes

#### 1) Rotation proof moved out of per-layer proof and into chip proof
Previously, `LayerProof` carried optional rotation data at layer scope.
Now, rotation proof is stored in `ZKVMChipProof.rotation_proof` and
handled once at chip level.

Implications:
- Per-layer proof payload is simplified (`LayerProof` now focuses on
main sumcheck path).
- Rotation presence/shape checks become explicit at chip verification
boundary.
- Recursion binding mirrors this with `has_rotation_proof` +
`rotation_proof` in chip-level input.

---

#### 2) Main constraints and rotation constraints are no longer split
into separate in-layer proving phases
Before, zerocheck flow effectively had a special rotation handling path
and then main constraints.
Now, main sumcheck flow is unified around `out_sel_and_eval_exprs`,
while rotation is produced/checked as a dedicated chip-level proof and
then mapped back through selector groups.

Implications:
- Less bifurcation in prover logic.
- Fewer “special-case” transitions between rotation and non-rotation
constraints.
- Cleaner challenge/eval accounting and easier reasoning about claimed
openings.

---

#### 3) Selector context construction is unified using first-layer
selector groups
CPU/GPU prover and verifier paths now build selector contexts by
iterating first-layer selector groups (`out_sel_and_eval_exprs`) rather
than relying on branchy ad-hoc logic.

Implications:
- Better CPU/GPU parity.
- Consistent selector semantics (`r_selector`, `w_selector`,
lookup/zero/whole selectors).
- Reduced risk of backend-specific drift in selector evaluation
behavior.

---

#### 4) Rotation selector groups become explicit and dedicated
In `gkr_iop` layer construction, rotation claims reserve dedicated
selector groups/opening slots (3-way grouping for
left/right/origin-style rotation openings).
A helper (`rotation_selector_group_indices`) is used to map rotation
claims to the right selector groups deterministically.

Implications:
- Unambiguous assignment of rotation claim evals to selector groups.
- Avoids accidental dedup/aliasing with ordinary selector groups.
- Makes verifier-side matching stricter and easier to audit.

---

### Design rationale

1. **Single source of truth for rotation handling**  
Rotation is conceptually chip-level logic that spans selector/eval
organization; storing it at chip level matches that abstraction better
than per-layer optional fields.

2. **Reduce proving-flow complexity and duplicated logic**  
Previous split paths (rotation-specific + main constraints) created
duplicated mechanics and increased maintenance burden. Unifying flow
lowers cognitive and code complexity.

---

### Net effect
This PR is a **proving architecture cleanup**:  
- rotation is elevated to chip-level proofing,
- selector/eval flow is unified across backends and verifiers,
- and proof shape invariants are tightened.
## Summary

> Goal: Add docs to explain our technical design in details. 

This should remove the burden to understand Ceno for both developers or
AI. These materials will also be helpful for design / review AI agents
to have a thorough understanding of our technology.

### Contents

topics that're covered:

1. architecture overview
- [x] multi-chip architecture (frontend): each instance of each chip can
span multiple rows, therefore the witness polynomials are $f_i(r, i)$,
we allow
        - same-row constraint/gate $0 = C(f_1(r,i), \ldots, f_w(r,i))$
- cross-row constraint/gate $0 = C(f_1(r,i), \ldots, f_w(r,i), f_1(r',
i), \ldots, f_w(r', i))$.
   - [x] multi-shards workflow
2. optimizations
   - [ ] distributed sumcheck
3. appendix
   - [x] gkr protocol for tower tree
   - [ ] gkr protocol for logup
   - [x] local rotation piop
   - [x] ecc grand sum piop
   
| PIOP | Purpose | Sumcheck instances | Opening points per committed MLE
|
   |---|---|---|---|
| GKR for Grand Product | Grand product $\prod_i a_i$ of $N = 2^d$
inputs | $d - 1$ | Input MLE $a$ at a single point $z \in B_d$ |
| Local Rotation PIOP | Round-to-round state transition for round-based
computations (e.g. Keccak-f) | $1$ | Each $f_j$ at three points
$(\mathbf{s}_r, \mathbf{s}_i), (\mathbf{p}_0, \mathbf{s}_i),
(\mathbf{p}_1, \mathbf{s}_i) \in B_m \times B_n$ |
| EC-Sum Quark PIOP | Sum $\sum_i P_i$ of EC points on a
short-Weierstrass curve | $1$ | $x, y$ at $(\mathbf{r}, 0), (\mathbf{r},
1), (1, \mathbf{r}) \in B_{n+1}$; $s$ at $(1, \mathbf{r})$ |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…PU OOM (#1316)

## Problem

PR #1299 changed GKR output/eval wiring, but the GPU proving flow still
treated `build_main_witness` as if only the old read/write/lookup
outputs existed. That caused two issues:
- stale scheduler / memcheck estimates after #1299
- unnecessary GPU witness materialization before tower proving, which
can push large Keccak payloads into OOM on 4090-class cards

## Design Rationale

Keep the proof shape and verifier unchanged, and fix this entirely in
prover-side staging:
- route first-layer GKR output groups with prover-only stage metadata
- materialize only tower-needed outputs before tower proving
- keep ECC / rotation self-contained in their existing submodules
- update GPU memory estimation to match the post-#1299 output topology

This reduces VRAM pressure during tower proving without changing proof
semantics.

## Change Highlights

- `ceno_zkvm`
- add prover-only `GkrOutputStageMask` routing for first-layer output
groups
- build only tower-facing witness outputs before `prove_tower_relation`
  - keep ECC / rotation on their existing dedicated witness/eval paths
- update GPU memory estimation for post-#1299 GKR outputs and
tower-stage residency
  - update local precompile/test callsites to the new `gkr_witness` API
- `gkr_iop`
- extend witness generation with filtered materialization APIs for
CPU/GPU backends
- keep filtering internal to prover execution; no verifier/proof-format
changes

## Benchmark / Performance Impact

Primary intent is memory reduction, not throughput optimization.

### Operation
Benchmark command(s):

```sh
CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

Ceno reth

https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/24667766891

## Testing

```sh
cargo make clippy
cargo check -p ceno_zkvm --features gpu
CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

## Risks and Rollout

- Risk is limited to prover-side witness staging and GPU memory
estimation.
- Verifier behavior and proof format are intentionally unchanged.
- If this regresses proving, rollback can revert the new output-stage
routing and filtered witness materialization together.

## Follow-ups (optional)

- Add a targeted large-payload regression check for post-#1299 GPU
WITGEN memory peaks.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## Summary

- `CLAUDE.md`: repo guide covering crate layout, toolchain, edit
priorities (soundness first, with verifier code including ceno_recursion
as the highest-scrutiny surface), and gotchas.
- `.github/pr-review-checklist.md`: canonical category-by-category
review checklist (transcript/Fiat–Shamir, sumcheck plumbing, PCS
openings, determinism, verifier robustness, feature parity,
recursion/native-verifier parity, scope).
- `.github/copilot-instructions.md`: surface the verifier-vs-prover
asymmetry and point to the new shared checklist instead of duplicating
it inline.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… group construction (#1310)

This update addresses reviewer feedback to explicitly document the
concrete expression proved in the main sumcheck phase. The code now
spells out the selector-group RLC form at the exact construction/proving
sites (CPU/GPU + layer assembly), states the mathematical form of the
smaller sumchecks batched by main sumcheck (including zero-target
zerochecks), and adds formula-level documentation in
`Layer::from_circuit_builder` for how output evaluation groups are
assembled.

- **Main-sumcheck expression clarity**
  - Added precise inline comments describing the polynomial shape:
    - per-group term construction in `zerocheck_layer`
    - main-sumcheck entry points in CPU and GPU provers
  - Files updated:
    - `gkr_iop/src/gkr/layer/zerocheck_layer.rs`
    - `gkr_iop/src/gkr/layer/cpu/mod.rs`
    - `gkr_iop/src/gkr/layer/gpu/mod.rs`

- **Concrete formulas now documented in-place**
  - Main batched polynomial:
```rust
p(x) = Σ_g p_g(x)
```
  - Per-group (smaller) sumcheck polynomial:
```rust
p_g(x) = sel_g(x) * Σ_j (α_{2+offset(g,j)} * expr_{g,j}(x))
```
  - Per-group and batched sumcheck targets:
```rust
S_g = Σ_{x in {0,1}^n} p_g(x)
Σ_{x in {0,1}^n} p(x) = Σ_g S_g
```
  - Zerocheck expectation (chip-derived constraints):
```rust
S_g = 0
Σ_{x in {0,1}^n} p(x) = Σ_g S_g = 0
```

- **Layer output-eval group construction (new)**
- Added comments in `Layer::from_circuit_builder` describing how groups
are formed for:
    - read (`r_selector`)
    - write (`w_selector`)
- lookup (`lk_selector`, including padding-normalized
non-negated/negated forms)
    - rotation (left/right/target groups)
    - ECC bridge (x/y/slope/x3/y3 groups)
    - zero constraints (`zero_selector`)
- Added a batched formula linking these groups to the main sumcheck term
construction and clarified how `offset(g,i)` is assigned from flattened
`expr_evals` order.
  - File updated:
    - `gkr_iop/src/gkr/layer.rs`

- **Expression assembly note**
  - In `zerocheck_layer`, comments clarify that:
    - `rlc_zero_expr` builds per-group `p_g(x)` terms
    - the final `p(x)` is the sum over all groups

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
## Problem

`e2e` skipped verification whenever `--shard-id` was set, even when only
a single shard proof was produced. That blocked valid single-shard
debugging and validation flows.

## Design Rationale

Treat single-shard verification as proof-soundness verification, not
full continuation-chain verification. Verify the shard proof itself and
halt invariants, but keep cross-shard continuation checks only for
multi-shard proof sets.

## Change Highlights

- `ceno_zkvm`: add standalone single-shard verifier path and route
single-proof `--shard-id` e2e runs through it
- `ceno_zkvm`: keep partial multi-shard subsets on the existing
skip-verify behavior
- `ceno_recursion`: add `--shard-id` plumbing to `e2e_aggregate` so
single-shard aggregation can be exercised directly

## Benchmark / Performance Impact

No intended performance change. This PR changes verification
gating/semantics for the single-shard debug path only.
## Testing

```sh
cargo check -p ceno_zkvm -p ceno_recursion --bins --release
cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 --shard-id=0 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 --shard-id=0 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

## Risks and Rollout

Main risk is semantic confusion between standalone shard verification
and full multi-shard continuation verification. This PR keeps those
paths separate: single-shard verifies proof validity only, while
multi-shard still owns continuation checks.

## Follow-ups (optional)

None.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## Problem

Dynamic heap/hint init tables need verifier-side checks so shard
continuation and dynamic-length constraints cannot be bypassed. This PR
also needs to stay compatible with current master after the
linker/public-io cleanup and the later single-shard debug verification
support.

## Design Rationale

Keep the memory-state verifier ISA-extensible by carrying RV32-specific
heap/hint bounds in RV32imMemStateConfig, while enforcing continuation
and dynamic-init checks in the native and recursion verifier paths.
Merge on top of master instead of reverting master-only changes.

## Change Highlights

- ceno_zkvm
  - restore verifier checks for dynamic heap/hint init tables
  - enforce heap/hint continuation and proof-size checks across shards
- keep ZKVMVerifier / ZKVMVerifyingKey extensible with a mem-state
verifier generic
- merge master single-shard e2e verification flow and fix its halt
expectation for debug shard verification
- ceno_recursion
  - restore heap/hint bound checks in aggregation leaf verification
  - merge shard-id plumbing for single-shard e2e_aggregate
- ceno_emul / ceno_rt
- reconcile this PR's memory layout swap with master's removal of the
PUBLIC I/O linker term
  - fix emulator dense-memory bounds for the merged layout

## Benchmark / Performance Impact

No intended performance change beyond verifier-side checks. Previous
measurements on this work showed negligible overhead; no new benchmark
run was needed for the merge-only follow-ups.

Benchmark command(s): not rerun for the merge-only follow-ups.

Environment (CPU/GPU, core count, rust toolchain, commit hash):
validated on local dev environment at head
7da2e88.

raw data:
- master: n/a
- this PR: n/a

## Testing

- cargo make clippy
- cargo check --config net.git-fetch-with-cli=true -p ceno_zkvm -p
ceno_recursion --bins --release
- cargo run --config net.git-fetch-with-cli=true --release --package
ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600
examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
- cargo run --config net.git-fetch-with-cli=true --features gpu
--release --package ceno_zkvm --bin e2e -- --platform=ceno
--max-cycle-per-shard=1600
examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
- cargo run --config net.git-fetch-with-cli=true --release --package
ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600
--shard-id=0
examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

## Risks and Rollout

Main risk is verifier semantic drift between full-trace verification and
single-shard debug verification. This branch keeps them separate:
full-trace verification still owns entry/continuation checks, while
single-shard debug verification checks only the selected shard segment.

## Follow-ups (optional)

- add a dedicated regression for the single-shard non-halt case in CI if
needed

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md
strictly.

---------

Co-authored-by: xkx <xiakunxian130@gmail.com>
related: #1265

# GPU Witness Generation

Accelerate witness generation by offloading computation from CPU to GPU.
This module (`ceno_zkvm/src/instructions/gpu/`) contains all GPU-side
dispatch,
caching, and utility code for the witness generation pipeline.

The CUDA backend lives in the sibling repo `ceno-gpu/`
(`cuda_hal/src/common/witgen/`).

## Architecture

### Module Layout

```
gpu/
├── dispatch.rs         — GPU dispatch entry point (try_gpu_assign_instances, gpu_fill_witness)
├── config.rs           — Environment variable config (3 env vars), kind tags
├── cache.rs            — Thread-local device buffer caching, shared EC/addr buffers
├── chips/              — Per-chip column map extractors + chip-specific GPU dispatch
│   ├── add.rs ... sw.rs  (24 RV32IM column map extractors)
│   ├── keccak.rs         (column map + keccak GPU dispatch: gpu_assign_keccak_instances)
│   └── shard_ram.rs      (column map + batch EC computation: gpu_batch_continuation_ec)
├── utils/
│   ├── column_map.rs   — Shared column map extraction helpers (extract_rs1, extract_rd, ...)
│   ├── d2h.rs          — Device-to-host: witness transpose, LK counter decode, compact EC D2H
│   ├── debug_compare.rs— GPU vs CPU comparison (activated by CENO_GPU_DEBUG_COMPARE_WITGEN)
│   ├── lk_ops.rs       — LkOp enum, SendEvent struct
│   ├── sink.rs         — LkShardramSink trait, CpuLkShardramSink
│   ├── emit.rs         — Emit helper functions (emit_u16_limbs, emit_logic_u8_ops, ...)
│   ├── fallback.rs     — CPU fallback: cpu_assign_instances, cpu_collect_lk_and_shardram
│   └── test_helpers.rs — Test utilities: assert_witness_colmajor_eq, assert_full_gpu_pipeline
└── mod.rs              — Module declarations + lk_shardram integration tests (19 tests)
```

### Data Flow

```
                    Pass 1: PreflightTracer
                    ┌──────────────────────┐
                    │  ShardPlanBuilder     │ → shard boundaries
                    │  addr_future_accesses │ → next-access HashMap (GPU cache reads and sorts before H2D)
                    └──────────┬───────────┘
                               │
                    Pass 2: FullTracer (per shard)
                    ┌──────────▼───────────┐
                    │  Vec<StepRecord>      │ 136 bytes/step, #[repr(C)]
                    └──────────┬───────────┘
                               │ H2D (cached per shard in cache.rs)
                    ┌──────────▼───────────────────────────────────┐
                    │              GPU Per-Instruction              │
                    │  ┌─────────────┬──────────────┬────────────┐ │
                    │  │ F-1 Witness │ F-2 LK Count │ F-3 EC/Addr│ │
                    │  │ (col-major) │  (atomics)   │ (shared buf)│ │
                    │  └──────┬──────┴──────┬───────┴─────┬──────┘ │
                    └─────────┼─────────────┼─────────────┼────────┘
                              │             │             │
                      GPU transpose    D2H counters   flush at shard end
                              │             │             │
                    ┌─────────▼─────────────▼─────────────▼────────┐
                    │                 CPU Merge                     │
                    │  RowMajorMatrix  LkMultiplicity  ShardContext │
                    └──────────────────────┬───────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────┐
                    │           ShardRamCircuit (GPU)               │
                    │  Phase 1: per-row Poseidon2 (344 cols)       │
                    │  Phase 2: binary EC tree (layer-by-layer)    │
                    └──────────────────────┬───────────────────────┘
                                           │
                                           ▼
                                     Proof Generation
```

### Per-Shard Pipeline

Within `generate_witness()` (e2e.rs), each shard executes:

1. **upload_shard_steps_cached** — H2D `Vec<StepRecord>` (cached, shared
across all chips)
2. **ensure_shard_metadata_cached** — H2D shard scalars + allocate
shared EC/addr buffers
3. **Per-chip dispatch** — `gpu_fill_witness` matches `GpuWitgenKind` →
22 kernel variants
- Each kernel writes: witness columns (col-major), LK counters
(atomics), EC records + addr (shared buffers)
4. **flush_shared_ec_buffers** — D2H shared EC records + addr_accessed
into `ShardContext`
5. **invalidate_shard_steps_cache** — Free GPU shard_steps memory
6. **assign_shared_circuit** — ShardRamCircuit GPU pipeline (Poseidon2 +
EC tree)

### GPU/CPU Decision (dispatch.rs)

```
try_gpu_assign_instances():
  1. is_gpu_witgen_enabled()?          → CPU fallback if not set
  2. is_force_cpu_path() thread-local? → CPU fallback (debug comparison)
  3. I::GPU_LK_SHARDRAM == false?      → CPU fallback
  4. is_kind_disabled(kind)?           → CPU fallback
  5. Field != BabyBear?                → CPU fallback
  6. get_cuda_hal() unavailable?       → CPU fallback
  7. All pass                          → GPU path
```

### Keccak Dispatch

Keccak has a dedicated GPU dispatch path
(`chips/keccak.rs::gpu_assign_keccak_instances`)
separate from `try_gpu_assign_instances` because:
1. **Rotation**: each instance spans 32 rows (not 1), requiring
`new_by_rotation`
2. **Structural witness**: 3 selectors (sel_first/sel_last/sel_all) vs
the standard 1
3. **Input packing**: needs `packed_instances` with `syscall_witnesses`

The LK/shardram collection logic is identical to the standard path.

### Lk and Shardram Collection

After GPU computes the witness matrix, LK multiplicities and shard RAM
records
are collected through one of several paths (priority order):

| Path | Witness | LK Multiplicity | Shard Records | When |
|------|---------|-----------------|---------------|------|
| **A** Shared buffer | GPU | GPU counters → D2H | Shared GPU buffer
(deferred) | Default for all verified kinds |
| **B** Compact EC | GPU | GPU counters → D2H | Compact EC D2H
per-kernel | Older non-shared-buffer kinds |
| **C** CPU shardram | GPU | GPU counters → D2H | CPU
`cpu_collect_shardram` | GPU shard unverified |
| **D** CPU full | GPU | CPU `cpu_collect_lk_and_shardram` | CPU full |
GPU LK unverified |
| **E** CPU only | CPU | CPU `assign_instance` | CPU `assign_instance` |
GPU unavailable |

Currently all non-Keccak kinds use **Path A**. Paths B-E are
fallback/debug paths.

## E2E Pipeline Modes (e2e.rs)

```
create_proofs_streaming()
│
├─ Default GPU backend (CENO_GPU_ENABLE_WITGEN unset):
│   Overlap pipeline:
│     Thread A (CPU): witgen(shard 0) → witgen(shard 1) → witgen(shard 2) → ...
│     Thread B (GPU): ................prove(shard 0) → prove(shard 1) → ...
│     crossbeam::bounded(0) rendezvous channel for back-pressure
│
└─ CENO_GPU_ENABLE_WITGEN=1 (GPU witgen) or CPU-only build:
    Sequential pipeline:
      witgen(shard 0) → prove(shard 0) → witgen(shard 1) → prove(shard 1) → ...
      GPU shared between witgen and proving; no overlap possible.
```

## Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `CENO_GPU_ENABLE_WITGEN` | unset (CPU witgen) | Set to enable GPU
witness generation. Sequential witgen+prove pipeline. |
| `CENO_GPU_DISABLE_WITGEN_KINDS` | none | Comma-separated kind tags to
disable specific chips' GPU path. Example: `add,keccak,lw`. Falls back
to CPU for those chips. |
| `CENO_GPU_DEBUG_COMPARE_WITGEN` | unset | Enable GPU vs CPU comparison
for all chips. Runs both paths and diffs results. |

### `CENO_GPU_DEBUG_COMPARE_WITGEN` Coverage

When set, all failures are collected into a `DebugCompareReport`
(thread-local).
Detailed mismatches are logged via `tracing::error!` in real time; at
pipeline end
`assert_debug_compare_report()` prints a summary table and panics if any
failures exist.

**Per-chip (in dispatch.rs, for each opcode circuit):**
- `debug_compare_final_lk` — GPU LK multiplicity vs CPU
`assign_instance` baseline (all 8 lookup tables)
- `debug_compare_witness` — GPU witness matrix vs CPU witness
(element-by-element)
- `debug_compare_shardram` — GPU shard records (read_records,
write_records, addr_accessed) vs CPU
- `debug_compare_shard_ec` — GPU compact EC records vs CPU-computed EC
points (nonce, x[7], y[7])

**Per-chip, Keccak-specific (in chips/keccak.rs):**
- `debug_compare_keccak` — Combined witness + LK + shard comparison for
keccak's rotation-aware layout

**ShardRamCircuit (in chips/shard_ram.rs):**
- `debug_compare_shard_ram_witness` — GPU ShardRam witness vs CPU
baseline (from ShardRamInput)
- `debug_compare_shard_ram_witness_from_device` — GPU ShardRam witness
vs CPU baseline (D2H device buffer → convert → CPU assign)

**Per-shard, E2E level (in e2e.rs, all chips combined):**
- `log_shard_ctx_diff` — Aggregated addr_accessed comparison
(write/read_records skipped when GPU witgen enabled)
- `log_combined_lk_diff` — Merged LK multiplicities after
`finalize_lk_multiplicities()` (catches cross-chip merge issues)

## Tests

**79 tests total** (`cargo test --features gpu,u16limb_circuit -p
ceno_zkvm --lib -- "gpu"`)

| Category | Count | Location | What it tests |
|----------|------:|----------|---------------|
| Column map extraction | 33 | `chips/*.rs` (31 via `test_colmap!` macro
+ 2 manual) | Circuit config → column map: all IDs in-range and unique |
| GPU witgen correctness | 23 | `chips/*.rs` | GPU kernel output vs CPU
`assign_instance` (element-by-element witness comparison) |
| LK+shardram match | 19 | `gpu/mod.rs` | `collect_lk_and_shardram` /
`collect_shardram` vs `assign_instance` baseline |
| LkOp encoding | 1 | `utils/mod.rs` | `LkOp::encode_all()` produces
correct table/key pairs |
| EC point match | 1 | `scheme/septic_curve.rs` | GPU
Poseidon2+SepticCurve EC point vs CPU `to_ec_point` |
| Poseidon2 sponge | 1 | `scheme/septic_curve.rs` | GPU Poseidon2
permutation vs CPU |
| Septic from_x | 1 | `scheme/septic_curve.rs` | GPU
`septic_point_from_x` vs CPU |

### Running Tests

```bash
# All GPU tests (requires CUDA device)
CENO_GPU_ENABLE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu"

# Column map tests only (no CUDA device needed)
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "test_extract_"

# LK/shardram tests only (no CUDA device needed)
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "lk_shardram"

# With debug comparison enabled
CENO_GPU_ENABLE_WITGEN=1 CENO_GPU_DEBUG_COMPARE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_host -- test_elf
```

## Per-Chip Boilerplate Macros

Three macros in `instructions.rs` reduce per-chip GPU integration to ~3
lines:

```rust
impl Instruction<E> for MyChip {
    // Emit LK ops + shard RAM records (CPU companion for GPU witgen)
    impl_collect_lk_and_shardram!(r_insn, |sink, step, _config, _ctx| {
        emit_u16_limbs(sink, step.rd().unwrap().value.after);
    });

    // Collect shard RAM records only (when GPU handles LK)
    impl_collect_shardram!(r_insn);

    // GPU dispatch: try GPU → fallback CPU
    impl_gpu_assign!(dispatch::GpuWitgenKind::Add);
}
```

---------

Co-authored-by: Ming <hero78119@gmail.com>
Co-authored-by: xkx <xiakunxian130@gmail.com>
Co-authored-by: Ray Gao <qg2153@columbia.edu>
Add docs follows #1223 

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
Co-authored-by: xkx <xiakunxian130@gmail.com>
## Problem

A malformed proof could cause the verifier process to crash — any
`unwrap` / `expect` / unchecked indexing / `assert!` on proof-derived
data is a liveness / DoS risk (a crafted proof kills the process instead
of being cleanly rejected).

## Changes

- **Panic cleanup.** Convert every `assert!` / `assert_eq!` / `unwrap` /
`expect` / unchecked indexing on proof-derived data to `ZKVMError`
returns across `verify_proofs_halt`, `verify_proof_validity`,
`verify_chip_proof`, `TowerVerify::verify`, and
`EccVerifier::verify_ecc_proof`. A malformed proof is now rejected
cleanly in all paths.
- **Document the verifier's semantic contract.** New sections in
`CLAUDE.md` and `docs/src/technical-overview.md` ("What the verifier
guarantees") state the two program-level facts a valid Ceno proof
attests to: **execution starts at `vk.entry_pc`** and **the terminal
shard invokes the halt ecall**. The exit code is deliberately *not* a
verifier guarantee — `public_values.exit_code` is bound by the
halt-ecall chip to register `a0`, but the guest program defines its own
exit-code semantics, so a non-zero value may be a legitimate application
signal. Callers that want "exited successfully" compare `exit_code == 0`
themselves.
- `CLAUDE.md` additionally flags prefix proofs (`expect-halt = false`)
as a dev/bench affordance, not a production surface. This caveat is
contributor-facing and is kept out of the user-facing mdbook.

## Test plan

- [x] `cargo check --workspace --all-targets`
- [x] `cargo make clippy` (workspace, `-D warnings`)
- [x] `cargo test -p ceno_zkvm --lib` (152 passed)
- [x] `cargo test -p ceno_zkvm --lib scheme::` (18 passed)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Trigger the reth benchmark workflow in `scroll-tech/ceno-reth-benchmark`
when the `regression-e2e-reth` label is added to a PR.

- dispatches `run-benchmark-v2.yml` on repo B
- passes the PR head SHA as `ceno_version`
- uses default benchmark block `23817600`
- reports benchmark results back to the PR as a comment
…ter (#1319)

## Problem

This PR carries forward the Fiat-Shamir soundness fix for
prover-supplied evaluations onto current `master`.

Historically, several prover-provided evaluations were included in
proofs but not always absorbed into the transcript before later
challenges were sampled. That leaves room for internally consistent
forgeries if prover and verifier do not bind the same data in the same
order.

This work is extended by with few decorations and builds on the original
fix in [#1294](#1294) by
@MavenRain. Many thanks for identifying the issue clearly and putting
together the first end-to-end patch.

## Design Rationale

The goal here is to land the same soundness principle on top of the
newer codebase with minimal semantic drift:

- keep prover/verifier transcript ordering aligned
- preserve the original fix's intent while rebasing onto current
`master`
- factor repeated transcript-binding logic into small helpers where that
improves reviewability
- document a few subtle data-layout assumptions so future refactors are
less likely to break transcript consistency

This PR is intentionally not a larger transcript-architecture refactor.
It keeps the patch narrow and practical for the current code structure.

## Change Highlights

- `gkr_iop`
- bind final sumcheck / zerocheck / rotation evaluations into the
transcript in the verifier path
- factor the binding step into a small helper to make the transcript
rule explicit
- `ceno_zkvm`
- keep tower verifier transcript binding aligned with the prover for
active prod/logup rounds
- document the `TowerProofs` layout so it is clear that only active
rounds are stored
- `ceno_recursion`
- mirror the same transcript-binding behavior in the recursion verifier
DSL
- factor repeated challenger-observe logic into local helpers for
readability
- merge current `master`
  - resolve drift against the latest branch layout and verifier offsets

## Benchmark / Performance Impact

This is a soundness/correctness fix. No meaningful performance change is
intended.

## Testing

```sh
cargo check -p gkr_iop -p ceno_zkvm -p ceno_recursion
```

## Risks and Rollout

Main risk is transcript-order mismatch across prover, verifier, and
recursion verifier. This PR keeps those paths aligned and uses small
helpers/comments to make the ordering easier to audit.

Rollback is straightforward: revert this PR if any transcript
compatibility issue is discovered before merge.

## Follow-ups (optional)

- A more systematic long-term design would make transcript observation a
more explicit streaming interface across prover and verifier boundaries.
- The tower-level GPU/mock path mentioned in
[#1294](#1294) remains a useful
follow-up area if it is still relevant in the surrounding repos.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: Onyeka Obi <Onyeka.Obi@gmail.com>
The 6 slow basefold-verifier tests in `ceno_recursion` pin each `cargo
make tests` pass at ~10 min, ~20 min across both feature-set runs. Mark
them `#[ignore]` so default CI skips them; run locally with `cargo test
-p ceno_recursion --lib -- --ignored --skip aggregation`.

No CI step runs them in `--ignored` mode yet — follow-up if we want
merge-queue to still exercise them.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Ming <hero78119@gmail.com>
refer from https://github.com/marketplace/actions/workflow-dispatch

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
Fan out the Tests job into a 2-leg matrix so default and goldilocks run
on separate runners. Each leg gets its own cache key to avoid thrash.

Status-check names change to \`Run Tests (default)\` / \`Run Tests
(goldilocks)\` — branch-protection / merge-queue required checks need to
be updated when this lands.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Ming <hero78119@gmail.com>
## Summary

This PR makes the GPU prover compact/default-aware for MLE inputs while
keeping protocol-facing behavior unchanged. The verifier, PCS
dimensions, transcript, and sumcheck domains still use the logical
domain; only GPU-resident prover data can stay compact by occupied rows.

Companion CUDA kernel changes: scroll-tech/ceno-gpu#146.

## Compact vs logical domains

- Logical domain is the constraint/sumcheck/PCS domain: `num_vars`,
power-of-two padding, and rotation expansion remain verifier-visible and
unchanged.
- Compact domain is the occupied physical row range kept on GPU. Keccak
is the main stress case: each logical syscall instance expands to 32
physical rows, and compact buffers avoid padding those rows to the full
logical domain.
- Missing compact tail entries are represented by `tail_default`. Most
witness tails are zero; logup shared numerator can use one.
- CPU-side structures remain logical for compatibility. GPU proving
carries compact resident length plus logical metadata and materializes
logical shape only at boundaries that require it.

```text
Logical domain, protocol view:

  rows used by constraints / transcript / PCS / sumcheck
  [ occupied physical rows ][ logical tail padding ........ ]
  <----------------------- 2^num_vars ---------------------->

Compact GPU resident view:

  [ occupied physical rows ] + metadata { logical num_vars, tail_default }

Kernel read rule:

  if index < occupied_len: read compact[index]
  else:                    read tail_default
```

## Flow modes

- CPU backend: unchanged logical host MLE/RMM behavior.
- GPU backend with `CENO_GPU_ENABLE_WITGEN=0`: CPU witgen still produces
host traces; GPU proving extracts/copies occupied rows into compact GPU
MLE specs while preserving logical `num_vars` for constraints and
sumcheck.
- GPU backend with `CENO_GPU_ENABLE_WITGEN=1`: GPU witgen can feed
compact device-backed traces or replay-materialized inputs directly into
the same compact proving path.
- Replay-heavy Keccak/ShardRam: compact tower inputs are materialized
for tower, released, then rematerialized for ECC/rotation/main
constraints so peak VRAM is lower without changing proof semantics.

```text
CPU backend, no compact GPU semantics:

  CPU witgen
    -> logical RowMajorMatrix / MLEs
    -> CPU prover stages
    -> PCS / transcript / verifier all see logical domain
```

```text
GPU backend, CENO_GPU_ENABLE_WITGEN=0:

  CPU witgen
    -> logical host traces / committed PCS data
    -> per-chip GPU extraction
         host logical rows -> compact GPU MLE specs
         keep { occupied_len, logical num_vars, tail_default }
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain
```

```text
GPU backend, CENO_GPU_ENABLE_WITGEN=1:

  GPU witgen
    -> compact device-backed traces or replay sources
    -> deferred commit / replay materialization when needed
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain

  Keccak / ShardRam replay lifetime:

    materialize compact tower input
      -> prove tower
      -> drop tower input
      -> rematerialize for ECC / rotation / main
      -> open committed traces
```

## Proving semantics

- Sumcheck runs over logical domains. Compact metadata only changes how
GPU kernels read resident buffers and defaults for omitted tail entries.
- Tower build/prove consumes compact product/logup inputs directly and
avoids carrying full-domain padded tower inputs through the proof
lifetime.
- Rotation/main GKR use the same logical constraint domains while
accepting compact/default-aware GPU MLE inputs where the kernels support
it.
- Scheduler-facing estimates and memtracking distinguish compact
resident bytes from logical-domain temporary bytes to avoid both
under-booking and double-counting.

```text
Shared GPU proving path:

  compact/default-aware MLE specs
    { ptr, occupied_len, logical num_vars, tail_default }
        |
        +--> tower build/prove
        |      - compact product/logup inputs
        |      - logical sumcheck rounds
        |
        +--> rotation / main GKR
        |      - logical constraint domain
        |      - compact/default-aware reads
        |
        +--> PCS opening
               - verifier-visible logical dimensions unchanged
```

## Unified paths

- CPU witgen + GPU proving and GPU witgen + GPU proving now share the
same compact chip proof stages: tower, ECC, rotation, main constraints,
and PCS opening.
- Product/logup tower construction is centralized around compact specs,
including the scalar-one logup numerator case.
- Sequential and concurrent chip proving use the same estimator model,
with memtracking checks available to catch estimator drift.

## Reviewer focus

- Boundaries between compact resident length and logical `num_vars`.
- `tail_default` handling in sumcheck/tower, especially non-zero logup
numerator defaults.
- Keccak rotated physical rows and ShardRam replay/materialization
lifetime.
- Scheduler estimates for `CENO_GPU_ENABLE_WITGEN=0/1` and
`CENO_CONCURRENT_CHIP_PROVING=0/1`.
- Verifier/protocol parity: this PR should not change proof format or
transcript semantics.

## Benchmark

Source runs:

- Baseline: [ceno-reth-benchmark run 25004787999 attempt
1](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25004787999/attempts/1#summary-73224449314),
result
[mainnet23817600-20260427-234425](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/ceno/mainnet23817600-20260427-234425_summary.md)
- This PR: [ceno-reth-benchmark run 25004860748 attempt
2](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25004860748/attempts/2#summary-73295559209),
result
[mainnet23817600-20260428-074320](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/prover_mle_zero_padding/mainnet23817600-20260428-074320_summary.md)

Block: `23817600`. Per-operation app-prove rows are profile totals
across overlapped shard work, so they can exceed wall time;
E2E/app-prove rows are the wall-time comparison.

| Metric | Baseline | This PR | Delta | Change |
|--------|----------|---------|-------|--------|
| E2E total time | 81.900s | 80.900s | -1.000s | -1.22% |
| app_prove wall time | 67.200s | 66.300s | -0.900s | -1.34% |
| emulator | 10.400s | 10.500s | +0.100s | +0.96% |
| commit_traces | 8.075s | 8.049s | -0.026s | -0.32% |
| extract_witness_mles | 27.569s | 28.837s | +1.268s | +4.60% |
| transport_structural_witness | 3.475s | 3.058s | -0.417s | -12.00% |
| build_tower_witness_gpu | 4.711s | 3.413s | -1.298s | -27.55% |
| prove_tower_relation_gpu | 178.197s | 188.436s | +10.239s | +5.75% |
| prove_main_constraints | 24.464s | 24.238s | -0.226s | -0.92% |
| pcs_opening | 17.892s | 17.716s | -0.176s | -0.98% |
| CPU/GPU overlap gap | 3.910s | 3.930s | +0.020s | +0.51% |

Peak memory is extracted from concurrent benchmark job logs by taking
the max of `[gpu device]` snapshots. `pool_booked` is scheduler
reservation/estimate, not actual VRAM usage.

| Memory metric | Baseline peak | This PR peak | Drop | Drop % |
|---------------|--------------:|-------------:|-----:|-------:|
| `cuda_used` | 23637.19 MB | 21557.19 MB | 2080.00 MB | 8.80% |
| `pool_used` | 21792.58 MB | 19267.89 MB | 2524.69 MB | 11.59% |
| `pool_reserved` | 23136.00 MB | 21056.00 MB | 2080.00 MB | 8.99% |
| `pool_booked` | 23180.86 MB | 23180.87 MB | -0.01 MB | -0.00% |

Summary: wall time is slightly faster in this run (`81.9s -> 80.9s`).
Peak VRAM is lower (`cuda_used`: `23637.19 MB -> 21557.19 MB`, -`2080.00
MB` / -`8.80%`; `pool_reserved`: `23136.00 MB -> 21056.00 MB`, -`2080.00
MB` / -`8.99%`). Compact tower build is materially faster (`4.711s ->
3.413s`), while the overlapped tower proving profile total is higher
(`178.197s -> 188.436s`); because chip proving overlaps across shards,
the wall-time result is the primary performance signal.

## Validation commands

```sh
cargo check --features gpu --package ceno_zkvm --bin e2e
cargo make clippy
CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
CENO_GPU_MEM_TRACKING=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```
## Problem

Main sumcheck was proved and verified per chip, which duplicated
transcript work, selector/claim handling, and PCS opening plumbing
across chips.

This PR include Jagged PCS integration. For benchmark result, see
another PR #1336

## Design Rationale

Use one global batched main sumcheck proof while keeping PCS openings in
the existing suffix path. The verifier mirrors the prover transcript
order, including ECC bridge sampling before the global `combine subset
evals` challenge, and evaluates frontloaded expressions in the verifier.

## Change Highlights

- `ceno_zkvm`: batches main constraints into a single global proof path
across chip proofs.
- `ceno_zkvm`: keeps witness/fixed PCS openings per chip after global
main verification.
- `ceno_recursion`: mirrors native verifier changes for the batched main
proof.
- `ceno-gpu`: supports the batched main proving flow.

## Benchmark / Performance Impact

### CPU Integration E2E

Local CPU sanity compares PR CPU batched-main against a local `master`
baseline on `secp256r1_verify_prehash`.

| Case | Command Target | Shard Proof | vs Baseline | Result |
|---|---|---:|---:|---|
| Baseline `master` | `ceno_zkvm e2e --platform=ceno
.../secp256r1_verify_prehash` | 37.378s | Baseline | Pass |
| PR + gkr-backend worker-bit merge optimization | same target | 41.974s
| -1.12x | Pass |

CPU result: batched main is now `1.123x` slower than baseline
(`+12.30%`) on this integration target, rather than the previous
timeout-scale regression.

### GPU Reth Benchmark

Benchmark session compares the frontload baseline against successive
`feat/batch_main_sumcheck` optimization runs on block `23817600`, GPU
proving, `CENO_GPU_ENABLE_WITGEN=0`.

Comparison convention: lower time is better. Signed `x` values use `-Nx`
for slower-than-baseline wall time and `+Nx` for faster/lower-time
metrics; for example, taking twice as long is `-2.00x`.

### Timeline / Optimization Progress

| Date | Run | Ceno / GPU Commit | E2E | vs Baseline | app_prove | vs
Baseline | prove_batched_main_constraints | Short Highlight |
|---|---|---|---:|---:|---:|---:|---:|---|
| May 6 | [25419833788 / job
74559223217](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25419833788/job/74559223217)
| Ceno `7a07649b`, GPU `1118dca8` | 75.600s | Baseline | 61.000s |
Baseline | 0.000s | **Baseline**: frontload, per-chip main constraints |
| May 9 AM | [25594090744 / job
75136918384](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25594090744/job/75136918384)
| Ceno `dd229c00`, GPU `340651b4` | 103.000s | -1.36x | 87.400s | -1.43x
| 0.000s | Batched branch after alpha.28 upgrade; tower/extract totals
much lower but wall time regressed |
| May 9 PM | [25603601935 / job
75161599043](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25603601935/job/75161599043)
| Ceno `d5ae1b3a`, GPU `fbef26f3` | 104.000s | -1.38x | 88.300s | -1.45x
| 26.925s | Batched main proof enabled; new batched-main critical path
dominates |
| May 11 | [25655529702 / job
75302942526](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25655529702/job/75302942526)
| Ceno `c2c45cc9`, GPU `3dedbc78` | 91.800s | -1.21x | 76.500s | -1.25x
| 15.457s | Latest optimization: direct batched-main construction +
bucketed fold/eval GPU sumcheck |

### E2E / Layer

| Metric | Baseline | Latest Optimization | Comparison |
|---|---:|---:|---:|
| E2E total | 75.600s | 91.800s | -1.21x |
| emulator | 10.100s | 10.200s | -1.01x |
| app_prove wall time | 61.000s | 76.500s | -1.25x |

### App Prove Breakdown

Profiler module totals can overlap because chip proving is concurrent;
use `app_prove wall time` above for critical-path impact. The latest run
materially reduces the new batched-main cost, but total wall time is
still slower than the frontload baseline.

| Operation | Baseline | Batched May 9 AM | Batched May 9 PM | Latest
May 11 | Latest vs Baseline |
|---|---:|---:|---:|---:|---:|
| prove_batched_main_constraints | 0.000s | 0.000s | 26.925s | 15.457s |
New cost |
| prove_main_constraints | 22.622s | 0.000s | 0.000s | 0.000s | Removed
|
| extract_witness_mles | 24.155s | 3.760s | 3.713s | 3.739s | +6.46x |
| build_tower_witness_gpu | 3.491s | 0.323s | 0.316s | 0.323s | +10.81x
|
| prove_tower_relation_gpu | 176.090s | 24.008s | 24.417s | 24.857s |
+7.08x |
| pcs_opening | 15.246s | 15.207s | 15.164s | 15.175s | +1.00x |
| commit_traces | 6.827s | 6.814s | 6.851s | 6.857s | -1.00x |
| parsed rows total | 251.118s | 50.995s | 78.287s | 67.460s | +3.72x |

### Latest Improvement Against Previous Batched Run

| Metric | May 9 PM Batched Main | May 11 Latest | Improvement |
|---|---:|---:|---:|
| E2E total | 104.000s | 91.800s | +1.13x |
| app_prove wall time | 88.300s | 76.500s | +1.15x |
| prove_batched_main_constraints | 26.925s | 15.457s | +1.74x |
| parsed rows total | 78.287s | 67.460s | +1.16x |

Benchmark command:

```sh
CENO_GPU_ENABLE_WITGEN=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_CACHE_LEVEL=0 \
RUSTFLAGS="-C target-feature=+avx2" \
cargo run --features "jemalloc,gpu" --release --bin ceno-reth-benchmark-bin -- \
  --mode prove-app --block-number 23817600 --rpc-url <redacted> \
  --output-dir output --cache-dir rpc-cache
```

Environment:

- GitHub self-hosted GPU runner, CUDA device `cc=8.9`, `24GB` GPU
memory.
- Rust `nightly-2025-11-20`, cargo `1.93.0-nightly`.
- Baseline: [run 25419833788 / job
74559223217](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25419833788/job/74559223217),
Ceno `7a07649b`, GPU `1118dca8`,
[summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/frontload/mainnet23817600-20260506-142423_summary.md).
- 2026-05-09 early batched branch: [run 25594090744 / job
75136918384](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25594090744/job/75136918384),
Ceno `dd229c00`, GPU `340651b4`,
[summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260509-142948_summary.md).
- 2026-05-09 batched-main critical path: [run 25603601935 / job
75161599043](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25603601935/job/75161599043),
Ceno `d5ae1b3a`, GPU `fbef26f3`,
[summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260509-223459_summary.md).
- Latest optimization: [run 25655529702 / job
75302942526](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25655529702/job/75302942526),
Ceno `c2c45cc9`, GPU `3dedbc78`,
[summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260511-150859_summary.md).

Summary: latest optimization improves `prove_batched_main_constraints`
by `+1.74x` against the previous batched-main run (`26.925s -> 15.457s`)
and improves E2E by `+1.13x` (`104.000s -> 91.800s`). It remains slower
than the frontload baseline (`75.600s -> 91.800s`, `-1.21x`), with the
remaining gap concentrated in the new batched-main critical path.

## Testing

```sh
RUST_MIN_STACK=33554432 cargo check --package ceno_recursion --bin e2e_aggregate
RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

Also passed the linked GPU e2e benchmark run above.

## Risks and Rollout

- Soundness risk is concentrated in transcript ordering and verifier
frontload evaluation; native and recursion verifiers now follow the same
global proof flow.
- Performance is not yet an E2E win in the linked benchmark despite
removing per-chip main-constraint cost; further scheduling/host-overlap
work is needed before rollout as a performance improvement.

## Follow-ups

- Investigate reducing the new `prove_batched_main_constraints`
critical-path cost.
- Keep benchmark summaries explicit that parsed module totals overlap
and are not a wall-time decomposition.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: Velaciela <git.rover@outlook.com>
## Problem

`GPU_WITGEN,CACHE=1` can produce witness traces on GPU, but the prover
path still needs a clean device-resident commit flow. The goal is to
keep GPU-generated witness data usable through commit without falling
back to replay/deferred raw-cache logic or unnecessary host
materialization.

## Design Rationale

This PR treats GPU witness output as the source of truth for the commit
path: traces are normalized into device-backed row-major metadata,
committed through the GPU PCS path, and released once q'/commit no
longer needs the raw backing. The post-commit proving flow stays aligned
with the existing `CPU_WITGEN` path so correctness-sensitive transcript,
opening, and proof assembly logic remain shared.

The design avoids retaining replay plans as a second witness source.
This keeps ownership simpler: GPU witness generation owns raw device
buffers until q'/commit construction, then releases them before chip
proving pressure grows.

## Change Highlights

- `ceno_zkvm`: add GPU witness/device-backed trace commit path for
`GPU_WITGEN,CACHE=1`.
- `ceno_zkvm`: keep post-commit proving and opening flow shared with the
existing GPU prover path.
- `ceno_zkvm`: release shard GPU witness caches after proof
construction.
- `gkr_iop`: support GPU-side batched main-constraint proving
integration.

## CI Benchmark Summary

Compared CI benchmark runs:

- `GPU_WITGEN`: original PR benchmark numbers, kept for context.
- `CPU_WITGEN`:
[`26067686212`](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/26067686212),
branch `feat/witgen_gpu`, `CENO_GPU_ENABLE_WITGEN=0`
- `CPU_WITGEN (baseline)`:
[`26037135648`](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/26037135648),
branch `feat/update_dep`, `CENO_GPU_ENABLE_WITGEN=0`

| Metric | GPU_WITGEN | CPU_WITGEN | CPU_WITGEN (baseline) | Notes |
|---|---:|---:|---:|---|
| reth-block E2E | 111s | 80.2s | 83.2s | CPU_WITGEN feature branch is
fastest |
| app.prove | 107s | 65.6s | 68.2s | CPU_WITGEN feature branch improves
2.6s vs baseline |
| app_prove.inner | 96.6s | 65.6s | 68.2s | Same trend as app.prove |
| Witness total | 35.43s | 40.85s | 39.84s | GPU_WITGEN remains faster
raw witness gen |
| Proof total | 60.70s | 62.15s | 64.78s | CPU_WITGEN feature branch
improves proof total vs baseline |
| commit_traces total | 12.35s | 17.410s | 17.450s | GPU_WITGEN commit
path remains faster |
| commit_traces avg/shard | 950ms | n/a | n/a | Original GPU_WITGEN
per-shard metric kept |
| prove_tower_relation_gpu total | n/a | 119.624s | 22.515s |
Nested/overlapped span increased in feature run |
| prove_batched_main_constraints total | n/a | 7.934s | 7.639s | Slight
CPU_WITGEN regression |
| pcs_opening total | 9.91s | 9.857s | 10.061s | Stable |
| q commit total | 8.25s device_q | n/a | n/a | Original GPU_WITGEN q
metric kept |
| q commit avg/shard | 634ms | n/a | n/a | Original GPU_WITGEN q metric
kept |
| q inner commit avg | 449ms | n/a | n/a | Original GPU_WITGEN q metric
kept |
| CPU/GPU overlap gap | n/a | 3.170s | 3.200s | CPU_WITGEN overlap
unchanged |
| Overall result | 111s | 80.2s | 83.2s | CPU_WITGEN feature branch
beats baseline; GPU_WITGEN still loses overall due to lost overlap |

| Conclusion | Evidence |
|---|---|
| GPU_WITGEN still improves commit/witness subpaths | Original
GPU_WITGEN has faster witness total and commit_traces than CPU_WITGEN |
| GPU_WITGEN still loses overall | 111s E2E vs 80.2s CPU_WITGEN due to
lost shard witness/proof overlap |
| CPU_WITGEN feature branch is slightly faster than CPU_WITGEN baseline
| reth-block improves by 3.0s; app.prove improves by 2.6s |
| Commit/opening path is stable for CPU_WITGEN | commit_traces and
pcs_opening are within ~0.2s across CPU runs |

## Benchmark / Performance Impact

This is performance-sensitive. CI benchmark runs are used for comparable
end-to-end numbers because local wall time depends heavily on runner
scheduling and GPU availability.

### Operation

| Operation | master (s) | this PR (s) | Improve (master -> this PR) |
|-----------|------------|-------------|-----------------------------|
| Reth proving benchmark | See benchmark CI | See benchmark CI | See
benchmark CI |

### Layer

| Layer | master (s) | this PR (s) | Improve (master -> this PR) |
|-------|------------|-------------|-----------------------------|
| Witness commit/q' path | Host/materialized path | Device-backed GPU
path | Reduces host materialization and extra copies |
| Post-commit proving | Existing GPU flow | Existing GPU flow | Intended
to remain unchanged |

Benchmark command(s):

```sh
# ceno-reth-benchmark CI, GPU_WITGEN,CACHE=1 and CPU_WITGEN,CACHE=1 comparison runs
```

Environment (CPU/GPU, core count, rust toolchain, commit hash):

CI benchmark runner metadata and commit hashes are recorded in the
linked workflow runs.

raw data:

- master: benchmark CI artifacts
- this PR: benchmark CI artifacts

## Testing

```sh
cargo fmt --check
cargo check -p ceno_zkvm --features 'gpu,u16limb_circuit' --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".cuda_hal.path="../ceno-gpu/cuda_hal"'
```

## Risks and Rollout

- Main risk is lifetime/ownership mistakes around device-backed witness
buffers; the rollout keeps release points explicit and avoids replay
cache ownership.
- If regressions appear, disable `CENO_GPU_ENABLE_WITGEN` to return to
the existing `CPU_WITGEN` GPU proving path.

## Follow-ups (optional)

- Continue profiling per-chip GPU witness generation and q'
construction.
- Add scheduler-level overlap once device memory booking is precise
enough.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## Problem

Ceno opens witness and fixed Jagged commitments in the same proof, but
the previous integration still paid for separate inner Basefold
openings. That duplicates inner query/opening proof bytes even though
witness/fixed can share the same inner Basefold query set.

## Design Rationale

Keep the outer Jagged protocol literally separate while sharing only the
inner Basefold opening:

- witness and fixed keep separate Jagged commitments and separate Merkle
roots
- witness and fixed keep separate Jagged sumcheck/assist rounds
- each round keeps its own q' shape and reshape height from its existing
commitment lifecycle
- Ceno collects each Jagged round's inner opening claims and calls one
Basefold `batch_open_with_trace_materializer`
- the resulting `JaggedProof` contains all Jagged rounds plus one
required shared `inner_proof`

This is intentionally surgical: it does not refactor witness lifecycle,
q' ownership, q' residency, or fixed/witness commit paths. The only
integration change is moving inner Basefold opening from
per-Jagged-round execution to one batched inner opening after all Jagged
reductions.

Soundness/correctness rationale: prover and verifier transcript order is
aligned with gkr-backend: absorb all Jagged round reductions first, then
absorb/verify one inner Basefold opening over all inner claims.
Commitments remain independent, so sharing the inner proof does not
merge witness/fixed roots.

## Change Highlights

- `ceno_zkvm/src/scheme/gpu/mod.rs`
- GPU Jagged opening now collects `(round_proof, rho_row, col_evals)`
for each Jagged round.
  - Builds per-round inner opening claims without changing q' lifecycle.
- Calls one shared GPU Basefold `batch_open_with_trace_materializer` for
witness/fixed inner claims.
- `ceno_zkvm/src/scheme.rs`
- Extends existing proof-size display to include PCS-specific nested
breakdowns.
- `Cargo.toml`
- Pins `gkr-backend` and `ceno-gpu` dependencies to
`feat/jagged_single_commit`.

## Benchmark / Performance Impact

### Operation

| Operation | master (s) | this PR (s) | Improve (master -> this PR) |
|-----------|------------|-------------|-----------------------------|
| `reth-block` | 11.143 | 11.087 | +0.056s |
| `app.prove` | 10.579 | 10.559 | +0.020s |
| `create_proof_of_shard` | 6.315 | 6.509 | -0.194s |
| `commit_traces` | 1.354 | 1.535 | -0.181s |
| `pcs_opening` | 1.492 | 1.483 | +0.009s |
| shard-0 proof size | 6.15 MiB | 5.48 MiB | +0.67 MiB smaller (-10.96%)
|

### Layer

| Layer | master (s) | this PR (s) | Improve (master -> this PR) |
|-------|------------|-------------|-----------------------------|
| Basefold witness commit/query | 0.423 / 0.270 | 0.427 / 0.269 |
approximately flat |
| Basefold fixed commit/query | 0.0147 / 0.00325 | folded into shared
opening | removes separate inner proof |
| Jagged outer rounds | separate witness/fixed | separate witness/fixed
| unchanged by design |

### Proof-size breakdown, shard 0

Sizes are MiB, computed as bytes / 2^20. Percent is relative to the
after proof file (`5.48 MiB`).

| Component | Size | % of Proof |
|---|---:|---:|
| Proof file `app_proof.bitcode` | 5.48 MiB | 100.00% |
| `mpcs_opening.total` | 1.09 MiB | 19.95% |
| `mpcs_opening.rounds` | 0.008 MiB | 0.14% |
| `round[0]` witness Jagged round | 0.005 MiB | 0.09% |
| `round[1]` fixed Jagged round | 0.003 MiB | 0.05% |
| shared `inner_proof` | 1.09 MiB | 19.81% |
| `inner_proof.query_opening_proof` | 1.08 MiB | 19.78% |
| `inner_proof.commits` | 0.001 MiB | 0.01% |
| `inner_proof.sumcheck_proof` | 0.001 MiB | 0.02% |
| `inner_proof.final_message` | 0.0003 MiB | 0.01% |

Raw proof-size data:

| Metric | Before | After | Delta |
|---|---:|---:|---:|
| Shard 0 proof file | 6,453,943 B | 5,746,603 B | -707,340 B (-10.96%)
|
| Single proof object | n/a | 5,746,602 B | n/a |
| `mpcs_opening.total` | n/a | 1,146,525 B | n/a |
| shared `inner_proof` | n/a | 1,138,405 B | n/a |

## Testing

```sh
cargo fmt -p ceno_zkvm
cargo check -p ceno_zkvm --features gpu
```

Additional dependency checks:

```sh
# gkr-backend
cargo fmt -p mpcs
cargo check -p mpcs
cargo check -p mpcs --all-targets

# ceno-gpu
cargo fmt -p cuda_hal
cargo check -p cuda_hal --features bb31
```

E2E validation:

- `../ceno-reth-benchmark`, block `23587691`, shard `0`, GPU,
`prove-app`, verifier passed.

## Risks and Rollout

Risk is transcript/order mismatch because inner Basefold proof
generation moved out of each Jagged round and into one shared call after
all Jagged reductions. The verifier mirrors this order in gkr-backend,
and shard-0 e2e verification passed.

Performance risk is low: proof size improves materially while
`pcs_opening` is flat on the measured shard. `commit_traces` varied
upward in this run, but this PR does not change commit lifecycle or q'
materialization.

Rollback is localized: restore the previous per-Jagged-round inner
opening path and the old dependency pins.

## Follow-ups (optional)

None required for this PR. Broader cleanup can later remove temporary
local benchmark patching once dependency branches are merged.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## Problem

Issue #1338 reproduces a soundness break on `master`. For the same
RISC-V
execution, the base verifier *and* the recursion verifier both accept
two
distinct proof batches whose public per-shard `shard_rw_sum` values
differ
on all 17 shards. The attacker takes an honest witness, replaces every
cross-shard EC accumulator leaf `(x, y)` with its inverse `(x, -y)`,
updates `shard_rw_sum`, and reproves.

Root cause: `ceno_zkvm/src/tables/shard_ram.rs:276-281` was a TODO. The
host code in `ShardRamRecord::to_ec_point` encodes read vs write in the
sign of `y[6]`, but the circuit only constrained the curve equation and
the EC sum — never tying `y[6]`'s half-of-field to `is_global_write`.
Both `(x, y)` and `(x, -y)` satisfied every existing check, so the
public
summary of cross-shard RAM flow was unbound.

The defect survives recursion (the reporter's PoC verifies through the
recursion verifier program).

## Design Rationale

Approach borrows the **idea** from SP1's
`crates/core/machine/src/operations/global_interaction.rs:210-236`,
not its column layout. Three pieces:

1. **Offset by +1.** Express `y[6]` in terms of a fresh witness `y6_lo`
so `y[6] = 0` is never valid in either branch (it is invariant under the
negate operation, thus make it impossible to distinguish read and
write).
2. **Safe band + prover retry.** Restrict `y6_lo` to `[0, (p-1)/2)`. For
   the rare exception `y[6] = 0` (probability `~1/p ≈ 2^-31` per record)
   the host rejects and retries with a new `nonce`.
3. **Byte-decomposition range check.** `y6_lo` decomposed into four byte
limbs `b0..b3` (`assert_byte` for `b0..b2`, `lookup_ltu_byte(b3, 60, 1)`
   for `b3`). For BabyBear, `(p-1)/2 = 60·2^24` exactly, so `b3 < 60`
   gives the tightest no-overlap band.

In-circuit branch equality via `condition_require_equal`:

- read (`is_global_write = 0`): `y[6] = y6_lo + 1` ⇒ `y[6] ∈ [1,
(p-1)/2]`
- write (`is_global_write = 1`): `y[6] = p - 1 - y6_lo` ⇒ `y[6] ∈
[(p+1)/2, p-1]`

Union covers `[1, p-1]` with no overlap; `y[6] = 0` is excluded.

**Why not a single `AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)`?**
On BabyBear (`p = 0x78000001`, 31-bit) the AssertLt gadget only
constrains `lhs - rhs ≡ diff - 2^max_bits (mod p)` with `diff ∈ [0,
2^30)`
— it does not pre-bound `lhs` to be canonical-small. A malicious
`y6_lo ∈ [0x74000001, p-1]` (≈ 2^26 values) produces a *field-wrap* diff
that still fits in 30 bits, so the constraint accepts upper-half values
and the exploit survives. Byte-decomposing first kills the wrap. Ceno's
`DynamicRangeTableCircuit<E, 18>` also does not carry 30-bit lookup
entries, so a direct `assert_const_range(_, 30)` is not available
anyway.

**Why M = 60 (vs SP1's 63).** SP1 targets KoalaBear; its `(p-1)/2 =
0x3f800000`, so 63 leaves a small safety band. For BabyBear,
`(p-1)/2 = 60·2^24` exactly — 63 would let `y[6]` straddle `p/2` and
reintroduce the ambiguity.

Also corrects the stale comment that previously had the convention
reversed (claimed write ⇒ lower half, opposite of what the host code
does).

## Change Highlights

### `ceno_zkvm/src/tables/shard_ram.rs` — chip-level y-sign binding

- `ShardRamRecord::to_ec_point`: reject `y6 == 0` and try the next
  `nonce`. Classify with strict `y6 > prime / 2` so the boundary
  `(p-1)/2` correctly stays in the read region (a previous draft used
  `>=` which misclassified that single boundary value and would have
  produced an out-of-range `y6_lo` for both branches).
- `ShardRamConfig`: new field `y6_lo_bytes: [WitIn; 4]`.
- `ShardRamConfig::configure`: replace the TODO with the byte
  decomposition, byte-range / LTU lookups, and the
  `condition_require_equal` branch equality.
- `ShardRamCircuit::assign_instance`: compute `y6_lo` from `y[6]` and
  `is_to_write_set` via a small `y6_lo_value` helper, assign byte
  limbs, register byte and LTU multiplicities.
- New test `test_shard_ram_y_sign_circuit_rejects_negation` drives
  `assign_instances_with_lk_multiplicities` + `MockProver` over one
  honest row and one sign-flipped row, asserting `lookup_Ltu` rejects
  the tampered witness. A concrete challenge is supplied so the
  no-challenge `run` path doesn't drop `structural_witin`.

### Lookup-multiplicity plumbing for ShardRam

ShardRam's per-row y6_lo byte / LTU lookups must reach
`combined_lk_mlt` so the U8 / LTU table `mlt` columns balance.
ShardRam runs after opcode + dummy circuits, before
`finalize_lk_multiplicities`. To surface mlt without burdening every
other table circuit:

- `ceno_zkvm/src/tables/mod.rs`: `TableCircuit` trait gains a second
  default-unimplemented method
  `assign_instances_with_lk_multiplicities` alongside the existing
  `assign_instances`. ShardRam overrides the former; every other
  table keeps overriding the latter.
- `ceno_zkvm/src/structs.rs`: `ZKVMWitnesses::assign_shared_circuit`
  threads a `LkMultiplicity::default()` through ShardRam's
  parallel-chunk witgen and inserts
  `lk_multiplicity.into_finalize_result()` into
  `lk_mlts["ShardRamCircuit"]` before finalize. Asserts swap from
  `combined_lk_mlt.is_some()` to `is_none()` to lock the ordering.
  `assign_table_circuit` tolerates `combined_lk_mlt = None` by
  passing an empty multiplicity slice, so `LocalFinalCircuit` (which
  ignores the argument anyway) can also run before finalize.
- `ceno_zkvm/src/e2e.rs`: move
  `MmuConfig::assign_continuation_circuit` (LocalFinal + ShardRam) to
  just before `finalize_lk_multiplicities`. Mirror the move inside
  the GPU debug-compare block so `combined_lk_mlt` diff stays
  meaningful.
- `ceno_zkvm/src/instructions/riscv/rv32im/mmu.rs`: docstring updated
  to describe the new ordering invariant.

### Device-resident GPU shortcut for ShardRam (mlt mirror)

`ZKVMWitnesses::try_assign_shared_circuit_gpu` dispatches into
`instructions::gpu::chips::shard_ram::try_gpu_assign_shared_circuit`
to keep the continuation EC computation device-resident
(`gpu_batch_continuation_ec_on_device` + `merge_and_partition_records`)
when `is_gpu_witgen_enabled()`. The GPU kernels never enter the CPU
`assign_instance` per-row push, so the y6_lo lookup multiplicity is
derived host-side:

- After step 6 of `try_gpu_assign_shared_circuit` (merge+partition),
  D2H `partitioned_buf` once to `Vec<u32>` and walk it with stride
  `record_u32s = 26` (`GpuShardRamRecord` `#[repr(C)]` layout).
  Per record extract `is_to_write_set` (u32 offset 10) and
  `point_y[6]` (u32 offset 25), compute `y6_lo`, push the same
  4 lookup queries the CPU path emits per row, then
  `into_finalize_result()` and return alongside the chunked
  `Vec<ChipInput<E>>`. `debug_assert_eq!(record_u32s, 26)` guards
  against `ceno_gpu` layout drift.
- `try_assign_shared_circuit_gpu` inserts both `ChipInput` and the
  derived multiplicity into `self.witnesses` /
  `self.lk_mlts["ShardRamCircuit"]` so finalize folds the GPU-path
  contribution into `combined_lk_mlt` the same way the CPU shortcut
  does.

### Verifier: account for `has_ecc_ops` row doubling

`ShardRamCircuit::has_ecc_ops()` adds an extra hypercube variable;
the chip matrix has `2 * next_pow2(num_instance)` rows where the
back half is EC-tree internal nodes with `selector_zero = 0`. Before
this fix the chip had `num_lks = 0`, so the verifier's
`dummy_table_item_multiplicity` correction never had to consider it.
With the new byte/LTU queries the correction under-counted dummy
lookups by a factor of 2 and shard verification failed with
`logup_sum != 0`.

- `ceno_zkvm/src/scheme/verifier.rs`: multiply `next_pow2_instance`
  by 2 when `circuit_vk.get_cs().has_ecc_ops()`.
- `ceno_recursion/src/zkvm_verifier/verifier.rs`: mirror the same
  adjustment in the recursive verifier (lockstep per CLAUDE.md).

## Benchmark / Performance Impact

Per ShardRam row this PR adds **4 byte WitIn columns** plus 3 byte-range
and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard
RAM events, not with cycles, so the absolute cost is sub-percent on the
prover. No full prover bench was rerun (no hot-loop arithmetic changed).

Existing `test_shard_ram_circuit` (170k reads + 1420 writes, full chip
proof) runtime is unchanged within noise:

```text
master   : ~5.0 s
this PR  : ~5.0 s
```

## Testing

```sh
cargo fmt --all --check
cargo check --workspace --all-targets
cargo check --workspace --all-targets --release
cargo make clippy
cargo clippy --workspace --all-targets --release -- -D warnings
RUST_MIN_STACK=33554432 cargo test --workspace --lib --release
cargo run --release --package ceno_zkvm --features sanity-check --bin e2e -- \
  --platform=ceno --max-cycle-per-shard=20000 --hints=10 --public-io=4191 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/fibonacci
```

All pass locally on BabyBear. `test_shard_ram_circuit` and
`test_shard_ram_y_sign_circuit_rejects_negation` are green. End-to-end
multi-shard fibonacci verifies `ShardRamCircuit` and
`LocalRAMTableFinal`
on every shard with `exit code 0. Success.`

`cargo make tests` / `cargo make tests_goldilock` should be re-run by
CI; the change is gated to BabyBear via a `debug_assert_eq!` on
`MODULUS_U64` and goldilocks does not exercise shard_ram (per
`integration.yml` commented-out lines and CLAUDE.md).

## Risks and Rollout

- **Soundness.** Closes #1338. The new constraint only adds local byte
  arithmetic and existing lookups — no change to transcript, sumcheck,
  PCS, or EC accumulation. Recursive and native verifiers move in
  lockstep (the `has_ecc_ops` row-factor fix lands in both).
- **GPU.** The device-resident GPU shortcut now derives the y6_lo
  lookup multiplicity host-side from the merged partitioned device
  buffer (single D2H of ~26 u32 × records). Layout assumption is
  guarded by `debug_assert_eq!(record_u32s, 26)` against
  `ceno_gpu::GpuShardRamRecord`. CPU + GPU paths converge on the same
  `combined_lk_mlt` contribution; runtime verification with
  `CENO_GPU_ENABLE_WITGEN=1 --features gpu` on a CUDA host is
  recommended before tag.
- **Recursion.** The recursive verifier mirrors the native verifier's
  `has_ecc_ops` × 2 row adjustment; no separate constraint-system
  change is needed for the y-sign binding itself.
- **Field support.** Hardcodes the BabyBear constant `M = 60`. A
  `debug_assert_eq!(MODULUS_U64, 0x78000001, ...)` guards against
  accidental use on a different field; shard_ram is BabyBear-only
  today per CLAUDE.md.

## Follow-ups

- The remaining #1340 TODOs (`local read ⇄ global write` pairing on
  `shard_ram.rs:235-236`, `shard == shard_id` binding on line 244) are
  intentionally out of scope here.

Fixes #1338.
Partially addresses #1340.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
## Problem

left-over from #923. CPU trace
commit cloned large witness MLEs before proving, adding avoidable memory
traffic on the prover hot path.



## Design Rationale

Keep committed witness MLEs behind `Arc` and drain/transport ownership
where possible, avoiding deep clones without changing proof semantics.

## Change Highlights

- `ceno_zkvm`: return `Arc` witness MLEs from trace commit and consume
structural MLEs during transport.
- `ceno_zkvm`: keep GPU trait shape aligned while preserving existing
GPU behavior.

## Benchmark / Performance Impact

### Operation

| Operation | master (s) | this PR (s) | Improve (master -> this PR) |
|-----------|------------|-------------|-----------------------------|
| CPU proving, keccak e2e shard total | 6.942 | 6.596 | 4.98% faster |
| GPU proving, keccak e2e shard total | 1.191 | 1.186 | 0.44% faster |

### Layer

| Layer | master (s) | this PR (s) | Improve (master -> this PR) |
|-------|------------|-------------|-----------------------------|
| N/A: shard-level proving total measured | N/A | N/A | no regression
observed |

Benchmark command(s):

```sh
cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

Environment: local x86_64 Linux, release build, local
`../ceno-gpu/cuda_hal` patch for GPU validation.

raw data:

- master: CPU shards `3.272s + 3.670s`; GPU shards `0.624s + 0.568s`
- this PR: CPU shards `3.336s + 3.260s`; GPU shards `0.593s + 0.593s`

## Testing

```sh
cargo check --config net.git-fetch-with-cli=true --package ceno_zkvm --bin e2e
cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
```

## Risks and Rollout

Low risk: prover-side ownership change only. Rollback is reverting the
`Arc` witness-MLE plumbing.

## Follow-ups (optional)

None.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## What

A self-hosted **GPU CI runner** for Ceno, plus a workflow that runs GPU
integration tests on it.

## `ci/gpu-runner/` — the runner

- **Dockerfile**: `nvidia/cuda:12.8-devel-ubuntu24.04` + actions runner
+ pinned Rust toolchain. No secrets baked in.
- **entrypoint.sh**: mints a fresh registration token from a PAT each
start (survives restarts); runs an **ephemeral** runner (clean container
per job).
- **start-runner.sh**: builds + `docker run --gpus all`; persists cargo
registry and `CARGO_TARGET_DIR` on volumes for warm rebuilds.
- **watchdog.sh**: cron-driven; restarts the container on stop/crash or
GPU-unreachable. `flock`-guarded against overlap.
- **README.md**: host setup, secrets, cron registration.

## `.github/workflows/gpu-integration.yml` — the test

Proves an example (default `keccak_syscall`) with `--features gpu` on
the `gpu`-labeled runner. Triggered manually (`workflow_dispatch`) or by
a `gpu-ci` PR label. Release-only by design (debug + release would
compile the workspace twice). Steps: load `CENO_GPU_DEPLOY_KEY` → clone
private `ceno-gpu` (`--recurse-submodules`) → activate its Cargo
`[patch]` → build + prove.

## Secrets

- **`GITHUB_PAT`** (host `runner.env`, gitignored): registers the
runner.
- **`CENO_GPU_DEPLOY_KEY`** (repo secret): read-only deploy key for the
private `ceno-gpu` backend, loaded per-job via ssh-agent.

No existing workflows are modified.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
#1349)

Closes #356.

## Summary

- Add `ZKVMVerifyingKey::compute_digest` /
`ZKVMProvingKey::compute_vk_digest`: one `bincode::serialize` pass over
the VK preimage fed to a fresh `DefaultChallenger` (Poseidon sponge),
squeezing `VK_DIGEST_LEN = 2` extension-field elements.
- Absorb those felts into the protocol transcript at the start of both
prover and verifier (before public-input absorption).
- Mirror the same absorption in the recursion verifier with the digest
baked in as in-circuit `builder.constant(...)` elements.
- Mark `ConstraintSystem.debug_map` (non-deterministic `HashMap`) and
`ZKVMVerifyingKey.circuit_index_to_name` (recoverable from `circuit_vks`
lex order, kept live for verifier error labels) `#[serde(skip)]` so
neither contaminates the canonical preimage bytes.

A swapped or mutated VK now flips the very first Fiat-Shamir challenge,
and verification rejects at the first sumcheck claim.

## Files touched

| File | Change |
| --- | --- |
| `ceno_zkvm/src/structs.rs` | digest helpers + `#[serde(skip)]
circuit_index_to_name` |
| `ceno_zkvm/src/scheme/prover.rs` | absorb `vk_digest` into transcript
|
| `ceno_zkvm/src/scheme/verifier.rs` | absorb `vk_digest` into
transcript |
| `ceno_recursion/src/zkvm_verifier/verifier.rs` | mirror absorb with
build-time constants |
| `gkr_iop/src/circuit_builder.rs` | `#[serde(skip)] debug_map` |
| `ceno_zkvm/Cargo.toml` | add `poseidon` direct dep |

## Test plan

- [x] `cargo fmt --all --check`
- [x] `cargo check --workspace --all-targets` (debug + release)
- [x] `cargo make clippy` + release clippy
- [x] `cargo make tests` — 0 failed
- [x] `cargo make tests_goldilock` — 0 failed
- [x] Local integration e2e (mirror of
`.github/workflows/integration.yml`, 24 steps: fibonacci / ceno_rt_alloc
/ keccak / secp256k1 / bn254 / k256 / p256 / uint256 / sha / aggregation
e2e) — all 24 STEP OK; the aggregation e2e validates the recursion
mirror end-to-end.

## Compatibility

Proof format changes; any cached proofs / `vk_bytes` artifacts must be
regenerated. README marks the project pre-production, so a breaking
proof-format change is acceptable.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
## Problem

Dynamic RAM init could write non-zero values into committed witness
padding rows. That violates the expected zero-padding invariant for RMM
witness commitments and causes CPU shard verification to fail with
`InvalidPcsOpen`.

## Design Rationale

Keep the existing padded CPU proving flow and add the missing
dynamic-init guard so only real instances populate committed witness
rows. This preserves the RMM zero-padding contract without changing
verifier logic.

Related: scroll-tech/gkr-backend#62

## Change Highlights

- `ceno_zkvm`: guard dynamic RAM init witness writes with `i <
num_instances` so padded witness rows remain zero.
- No verifier changes.

## Benchmark / Performance Impact

### Operation

CPU benchmark: block `23587691`, shard `0`,
`CENO_MAX_CELL_PER_SHARD=805306368`.

| Operation | baseline | this PR (s) | Improve (before -> this PR) |

|-----------|-----------------------------------|-------------|-----------------------------|
| reth-block | 135 | 124 | 8.1% |
| app.prove | 134 | 123 | 8.2% |
| app.verify | 0.300 | 0.288 | 4.0% |

### Layer

| Layer | before: no guard + remote gkr (s) | this PR (s) | Improve
(before -> this PR) |

|-------|-----------------------------------|-------------|-----------------------------|
| create_proof_of_shard | 130 | 119 | 8.5% |
| commit_traces | 17.2 | 10.3 | 40.1% |
| prove_batched_main_constraints | 32.4 | 32.0 | 1.2% |
| pcs_opening | 36.8 | 33.9 | 7.9% |

Benchmark command(s):

```sh
CENO_MAX_CELL_PER_SHARD=805306368 \
OUTPUT_PATH=metrics_23587691_shard0_cpu_dynamic_guard_no_gkrzero_maxcell805306368_20260608.json \
RUST_LOG=info \
target/release/ceno-reth-benchmark-bin \
  --block-number 23587691 \
  --chain-id 1 \
  --cache-dir block_data \
  --mode prove-app \
  --app-proofs ./app_proof.bitcode \
  --shard-id 0
```

Baseline used the same benchmark with Ceno patched to `902b3e3c^`
(`7d8086c2`) and remote `gkr-backend` tag `v1.0.0-alpha.31`.

Environment (CPU/GPU, core count, rust toolchain, commit hash):

- CPU shard run, `rustc 1.93.0-nightly (07bdbaedc 2025-11-19)`
- before: Ceno `7d8086c2`, remote `gkr-backend v1.0.0-alpha.31`
- this PR: Ceno `902b3e3c`

raw data:

- before:
`sanity_23587691_shard0_cpu_fresh_baseline_remote_gkr_no_guard_maxcell805306368_20260608.log`,
failed verification with `InvalidPcsOpen`
- this PR:
`sanity_23587691_shard0_cpu_dynamic_guard_no_gkrzero_maxcell805306368_20260608.log`,
passed verification

## Testing

```sh
cargo check --package ceno_zkvm
CENO_MAX_CELL_PER_SHARD=805306368 target/release/ceno-reth-benchmark-bin --block-number 23587691 --chain-id 1 --cache-dir block_data --mode prove-app --app-proofs ./app_proof.bitcode --shard-id 0
```

## Risks and Rollout

Low risk: the change only skips dynamic RAM init writes for rows outside
`num_instances`. Rollback is reverting this guard.

## Follow-ups (optional)

None.

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.
## Problem

`HintsTable` is the **only** RAM table with prover-witnessed (non-zero)
init
values — it holds the guest's private hint inputs.
`DynVolatileRamTableInitConfig::construct_circuit` created those
init-value
limbs as raw `WitIn`s **with no range check**, so a malicious prover
could
supply non-canonical limbs (`>= 2^16`).

The load path reads memory via `UInt::new_unchecked` (e.g. `LW` in
`load_v2.rs`) and forwards the limbs to the destination register
unconstrained, so a non-canonical hint word would propagate into the
computation. This is an **under-constraint soundness bug**: a crafted
proof
using out-of-range hint limbs would verify.

## Design Rationale

Bind every non-zero-init limb to `LIMB_BITS` (u16) in
`construct_circuit`,
making each reconstructed hint word a canonical u32 — the same range
discipline every other prover-supplied word in the system already
follows.
The fix is on the constraint side (the load-bearing surface for
soundness);
per-row addresses are already formula-bound structural witins and the
table
base (`hint_start_addr`) is already range-checked by the verifier's
`validate_mem_state`, so this closes the one remaining unconstrained
axis (the
init **value** limbs).

A range check is a *looking* lookup, which the default
`TableCircuit::build_gkr_iop_circuit` does not wire for table circuits,
so the
fix needs three coordinated touches (below). For zero-init tables (heap,
stack) the looking lengths are 0, so all wiring stays **byte-for-byte
identical** to the current behaviour.

## Change Highlights

- **`ceno_zkvm/tables/ram/ram_impl.rs`** — range-check every
non-zero-init
limb (`assert_ux::<LIMB_BITS>`) in `construct_circuit`; thread an
optional
`LkMultiplicity` through `assign_instances` / `assign_instances_dynamic`
so
the per-limb u16 lookups are recorded. Adds the #999 soundness
regression
  test.
- **`ceno_zkvm/tables/ram/ram_circuit.rs`** — `DynVolatileRamCircuit`
  overrides `build_gkr_iop_circuit` to size the r/w/**lk**/zero out-eval
  groups with looking + table lengths; adds
  `assign_instances_with_lk_multiplicities`.
- **`ceno_zkvm/structs.rs`** — new
`ZKVMWitnesses::assign_table_circuit_with_lk`
  (mirrors `assign_shared_circuit`) to fold a table circuit's own lookup
  multiplicity into `lk_mlts` before finalize.
- **`ceno_zkvm/instructions/riscv/rv32im/mmu.rs`** — route
`HintsInitCircuit`
through `assign_table_circuit_with_lk`; `HeapInitCircuit` stays on the
plain
  path (zero-init, no lookups).
- **`ceno_zkvm/e2e.rs`** — move the dynamic-init-table assignment
**before**
  `finalize_lk_multiplicities` so the hint range-check lookups land in
`combined_lk_mlt`, matching the existing pre-finalize ShardRam ordering
  (main + GPU-debug-compare paths).

Rebased onto current `master`; merged with #1350's dynamic-init padding
guard
so the u16 multiplicity is recorded only for real rows
(`i < num_instances && let Some(rec) = rec_opt`), consistent with the
prefix
selector that gates the range-check constraints.

## Benchmark / Performance Impact

Not benchmarked — the change adds only `O(hint_words)` u16 range-lookups
during witness generation and leaves the verifier's structural shape
unchanged (zero-init tables produce identical circuits). Prover impact
is
negligible relative to the per-shard logup it already performs.

## Testing

```sh
cargo check --workspace --all-targets
cargo make clippy                       # -D warnings
cargo test -p ceno_zkvm tables::ram::ram_impl   # incl. new #999 regression
```

- **New soundness regression**
`test_hint_init_rejects_non_canonical_limb`:
honest (canonical) limbs satisfy every range lookup; forcing any single
init
limb to `2^LIMB_BITS` makes the `init_v_limb_{i}_in_u16` lookup fall
outside
the u16 table and `MockProver` rejects the witness. **The test fails if
the
  range check is removed.**
- Real (non-mock) prove + verify of **fibonacci with hints**, single
shard and
  61-shard — the global logup balances including the hint circuit's
  range-check records (a dropped lookup would unbalance the recorded
  multiplicity).
- Integration e2e suite (mirror of `.github/workflows/integration.yml`);
the
  fibonacci `--hints` MOCK path is confirmed passing on this branch.

## Risks and Rollout

- **Soundness:** strictly tightening — adds a constraint, removes an
  under-constraint. No verifier semantic-contract change.
- **Compatibility:** zero-init tables (heap/stack) are wired
identically; only
  `HintsTable` gains the lk group.
- **Rollback:** revert the commit; no migrations or persisted state.

## Follow-ups (optional)

- None required for soundness. (Optional defense-in-depth, separate from
this
  PR: make the u32 arithmetic in the verifier's `validate_mem_state`
  hint/heap bound explicit via `checked_add`/`checked_mul`.)

## Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply `.github/copilot-instructions.md`
strictly.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Resolve merge conflicts from merging origin/master (HintsTable range-check fix)
into feat/recursion-v2 (forked transcript + bump-p3). Take HEAD's versions for
prover/verifier architecture and p3 API renames (from_canonical_* -> from_*,
FieldAlgebra -> PrimeCharacteristicRing). gkr-backend pinned to feat/bump-p3
branch, p3-field bumped to 0.4.3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants