Merge master into recursion-v2 (bump-p3 + conflict resolution)#1361
Open
kunxian-xia wants to merge 30 commits into
Open
Merge master into recursion-v2 (bump-p3 + conflict resolution)#1361kunxian-xia wants to merge 30 commits into
kunxian-xia wants to merge 30 commits into
Conversation
# Description `n_challenges` on `Layer`, `Chip`, and `GKRCircuit` was always 0 — every layout struct initialized it to 0 and never assigned any other value. The design intended per-layer challenge sampling, but it was never actually used. This PR removes the field and all its plumbing: - `Layer.n_challenges` field and the `update_challenges` method it drove - `n_challenges` parameter from `Layer::new`, `Layer::from_circuit_builder`, `LayerConstraintSystem::into_layer`, and `LayerConstraintSystem::into_layer_with_lookup_eval_iter` - `Chip.n_challenges` and the corresponding `Chip::new_from_cb` parameter - `GKRCircuit.n_challenges` - `ProtocolBuilder::n_challenges()` trait method - `generate_layer_challenges` in the recursion verifier (replaced by passing `challenges` directly) No behavior change — `sample_and_append_challenge_pows(0, ...)` was a no-op, and `update_challenges` with n=0 neither sampled nor wrote anything meaningful. --------- Co-authored-by: sphere <sphere@scroll.io> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary This PR simplifies the chip `finalize` flow by moving layer construction into each chip/layout's `finalize` implementation, so callers no longer need to manually compute `out_evals`, build a `Layer`, and call `chip.add_layer(...)`. ## What changed - Added a shared `default_out_eval_groups(...)` helper in `gkr_iop`. - Updated `finalize(...)` implementations to directly assemble and return a fully built `Chip`. - Removed duplicated caller-side boilerplate in precompile, instruction, and table builders. ## Why This makes the finalize contract simpler and more consistent: callers ask for a finalized chip and receive a ready-to-use one, instead of partially constructing it in multiple places. ## Validation - `cargo check` - `cargo check --all-targets` --------- Co-authored-by: sphere <sphere@scroll.io>
- add copilot review prompt - only whiltelist ceno-book deploy when change `docs/**` and after merge
#1299) A pre-requisites PR before batch main-sumcheck across "all" chip. This PR refactor rotation and unify the gkr-circuit main-sumcheck flow. ### What changed (high level) This PR reshapes the proving/verification pipeline so **rotation constraints are handled at chip level** instead of being embedded in each GKR layer proof. Core effect: rotation logic is now a first-class chip-proof component (`rotation_proof`) and selector/eval wiring is unified across CPU, GPU, native verifier, and recursion verifier. Changed areas are mainly: - `ceno_zkvm/src/scheme/*` (prover/verifier flow and proof struct) - `gkr_iop/src/gkr/layer/*` (layer construction, selector grouping, zerocheck/sumcheck behavior) - `ceno_recursion/src/zkvm_verifier/*` (recursive proof input/verification alignment) --- ### Major proving-flow changes #### 1) Rotation proof moved out of per-layer proof and into chip proof Previously, `LayerProof` carried optional rotation data at layer scope. Now, rotation proof is stored in `ZKVMChipProof.rotation_proof` and handled once at chip level. Implications: - Per-layer proof payload is simplified (`LayerProof` now focuses on main sumcheck path). - Rotation presence/shape checks become explicit at chip verification boundary. - Recursion binding mirrors this with `has_rotation_proof` + `rotation_proof` in chip-level input. --- #### 2) Main constraints and rotation constraints are no longer split into separate in-layer proving phases Before, zerocheck flow effectively had a special rotation handling path and then main constraints. Now, main sumcheck flow is unified around `out_sel_and_eval_exprs`, while rotation is produced/checked as a dedicated chip-level proof and then mapped back through selector groups. Implications: - Less bifurcation in prover logic. - Fewer “special-case” transitions between rotation and non-rotation constraints. - Cleaner challenge/eval accounting and easier reasoning about claimed openings. --- #### 3) Selector context construction is unified using first-layer selector groups CPU/GPU prover and verifier paths now build selector contexts by iterating first-layer selector groups (`out_sel_and_eval_exprs`) rather than relying on branchy ad-hoc logic. Implications: - Better CPU/GPU parity. - Consistent selector semantics (`r_selector`, `w_selector`, lookup/zero/whole selectors). - Reduced risk of backend-specific drift in selector evaluation behavior. --- #### 4) Rotation selector groups become explicit and dedicated In `gkr_iop` layer construction, rotation claims reserve dedicated selector groups/opening slots (3-way grouping for left/right/origin-style rotation openings). A helper (`rotation_selector_group_indices`) is used to map rotation claims to the right selector groups deterministically. Implications: - Unambiguous assignment of rotation claim evals to selector groups. - Avoids accidental dedup/aliasing with ordinary selector groups. - Makes verifier-side matching stricter and easier to audit. --- ### Design rationale 1. **Single source of truth for rotation handling** Rotation is conceptually chip-level logic that spans selector/eval organization; storing it at chip level matches that abstraction better than per-layer optional fields. 2. **Reduce proving-flow complexity and duplicated logic** Previous split paths (rotation-specific + main constraints) created duplicated mechanics and increased maintenance burden. Unifying flow lowers cognitive and code complexity. --- ### Net effect This PR is a **proving architecture cleanup**: - rotation is elevated to chip-level proofing, - selector/eval flow is unified across backends and verifiers, - and proof shape invariants are tightened.
## Summary
> Goal: Add docs to explain our technical design in details.
This should remove the burden to understand Ceno for both developers or
AI. These materials will also be helpful for design / review AI agents
to have a thorough understanding of our technology.
### Contents
topics that're covered:
1. architecture overview
- [x] multi-chip architecture (frontend): each instance of each chip can
span multiple rows, therefore the witness polynomials are $f_i(r, i)$,
we allow
- same-row constraint/gate $0 = C(f_1(r,i), \ldots, f_w(r,i))$
- cross-row constraint/gate $0 = C(f_1(r,i), \ldots, f_w(r,i), f_1(r',
i), \ldots, f_w(r', i))$.
- [x] multi-shards workflow
2. optimizations
- [ ] distributed sumcheck
3. appendix
- [x] gkr protocol for tower tree
- [ ] gkr protocol for logup
- [x] local rotation piop
- [x] ecc grand sum piop
| PIOP | Purpose | Sumcheck instances | Opening points per committed MLE
|
|---|---|---|---|
| GKR for Grand Product | Grand product $\prod_i a_i$ of $N = 2^d$
inputs | $d - 1$ | Input MLE $a$ at a single point $z \in B_d$ |
| Local Rotation PIOP | Round-to-round state transition for round-based
computations (e.g. Keccak-f) | $1$ | Each $f_j$ at three points
$(\mathbf{s}_r, \mathbf{s}_i), (\mathbf{p}_0, \mathbf{s}_i),
(\mathbf{p}_1, \mathbf{s}_i) \in B_m \times B_n$ |
| EC-Sum Quark PIOP | Sum $\sum_i P_i$ of EC points on a
short-Weierstrass curve | $1$ | $x, y$ at $(\mathbf{r}, 0), (\mathbf{r},
1), (1, \mathbf{r}) \in B_{n+1}$; $s$ at $(1, \mathbf{r})$ |
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…PU OOM (#1316) ## Problem PR #1299 changed GKR output/eval wiring, but the GPU proving flow still treated `build_main_witness` as if only the old read/write/lookup outputs existed. That caused two issues: - stale scheduler / memcheck estimates after #1299 - unnecessary GPU witness materialization before tower proving, which can push large Keccak payloads into OOM on 4090-class cards ## Design Rationale Keep the proof shape and verifier unchanged, and fix this entirely in prover-side staging: - route first-layer GKR output groups with prover-only stage metadata - materialize only tower-needed outputs before tower proving - keep ECC / rotation self-contained in their existing submodules - update GPU memory estimation to match the post-#1299 output topology This reduces VRAM pressure during tower proving without changing proof semantics. ## Change Highlights - `ceno_zkvm` - add prover-only `GkrOutputStageMask` routing for first-layer output groups - build only tower-facing witness outputs before `prove_tower_relation` - keep ECC / rotation on their existing dedicated witness/eval paths - update GPU memory estimation for post-#1299 GKR outputs and tower-stage residency - update local precompile/test callsites to the new `gkr_witness` API - `gkr_iop` - extend witness generation with filtered materialization APIs for CPU/GPU backends - keep filtering internal to prover execution; no verifier/proof-format changes ## Benchmark / Performance Impact Primary intent is memory reduction, not throughput optimization. ### Operation Benchmark command(s): ```sh CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` Ceno reth https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/24667766891 ## Testing ```sh cargo make clippy cargo check -p ceno_zkvm --features gpu CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` ## Risks and Rollout - Risk is limited to prover-side witness staging and GPU memory estimation. - Verifier behavior and proof format are intentionally unchanged. - If this regresses proving, rollback can revert the new output-stage routing and filtered witness materialization together. ## Follow-ups (optional) - Add a targeted large-payload regression check for post-#1299 GPU WITGEN memory peaks. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## Summary - `CLAUDE.md`: repo guide covering crate layout, toolchain, edit priorities (soundness first, with verifier code including ceno_recursion as the highest-scrutiny surface), and gotchas. - `.github/pr-review-checklist.md`: canonical category-by-category review checklist (transcript/Fiat–Shamir, sumcheck plumbing, PCS openings, determinism, verifier robustness, feature parity, recursion/native-verifier parity, scope). - `.github/copilot-instructions.md`: surface the verifier-vs-prover asymmetry and point to the new shared checklist instead of duplicating it inline. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… group construction (#1310) This update addresses reviewer feedback to explicitly document the concrete expression proved in the main sumcheck phase. The code now spells out the selector-group RLC form at the exact construction/proving sites (CPU/GPU + layer assembly), states the mathematical form of the smaller sumchecks batched by main sumcheck (including zero-target zerochecks), and adds formula-level documentation in `Layer::from_circuit_builder` for how output evaluation groups are assembled. - **Main-sumcheck expression clarity** - Added precise inline comments describing the polynomial shape: - per-group term construction in `zerocheck_layer` - main-sumcheck entry points in CPU and GPU provers - Files updated: - `gkr_iop/src/gkr/layer/zerocheck_layer.rs` - `gkr_iop/src/gkr/layer/cpu/mod.rs` - `gkr_iop/src/gkr/layer/gpu/mod.rs` - **Concrete formulas now documented in-place** - Main batched polynomial: ```rust p(x) = Σ_g p_g(x) ``` - Per-group (smaller) sumcheck polynomial: ```rust p_g(x) = sel_g(x) * Σ_j (α_{2+offset(g,j)} * expr_{g,j}(x)) ``` - Per-group and batched sumcheck targets: ```rust S_g = Σ_{x in {0,1}^n} p_g(x) Σ_{x in {0,1}^n} p(x) = Σ_g S_g ``` - Zerocheck expectation (chip-derived constraints): ```rust S_g = 0 Σ_{x in {0,1}^n} p(x) = Σ_g S_g = 0 ``` - **Layer output-eval group construction (new)** - Added comments in `Layer::from_circuit_builder` describing how groups are formed for: - read (`r_selector`) - write (`w_selector`) - lookup (`lk_selector`, including padding-normalized non-negated/negated forms) - rotation (left/right/target groups) - ECC bridge (x/y/slope/x3/y3 groups) - zero constraints (`zero_selector`) - Added a batched formula linking these groups to the main sumcheck term construction and clarified how `offset(g,i)` is assigned from flattened `expr_evals` order. - File updated: - `gkr_iop/src/gkr/layer.rs` - **Expression assembly note** - In `zerocheck_layer`, comments clarify that: - `rlc_zero_expr` builds per-group `p_g(x)` terms - the final `p(x)` is the sum over all groups --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
## Problem `e2e` skipped verification whenever `--shard-id` was set, even when only a single shard proof was produced. That blocked valid single-shard debugging and validation flows. ## Design Rationale Treat single-shard verification as proof-soundness verification, not full continuation-chain verification. Verify the shard proof itself and halt invariants, but keep cross-shard continuation checks only for multi-shard proof sets. ## Change Highlights - `ceno_zkvm`: add standalone single-shard verifier path and route single-proof `--shard-id` e2e runs through it - `ceno_zkvm`: keep partial multi-shard subsets on the existing skip-verify behavior - `ceno_recursion`: add `--shard-id` plumbing to `e2e_aggregate` so single-shard aggregation can be exercised directly ## Benchmark / Performance Impact No intended performance change. This PR changes verification gating/semantics for the single-shard debug path only. ## Testing ```sh cargo check -p ceno_zkvm -p ceno_recursion --bins --release cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 --shard-id=0 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 --shard-id=0 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` ## Risks and Rollout Main risk is semantic confusion between standalone shard verification and full multi-shard continuation verification. This PR keeps those paths separate: single-shard verifies proof validity only, while multi-shard still owns continuation checks. ## Follow-ups (optional) None. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## Problem Dynamic heap/hint init tables need verifier-side checks so shard continuation and dynamic-length constraints cannot be bypassed. This PR also needs to stay compatible with current master after the linker/public-io cleanup and the later single-shard debug verification support. ## Design Rationale Keep the memory-state verifier ISA-extensible by carrying RV32-specific heap/hint bounds in RV32imMemStateConfig, while enforcing continuation and dynamic-init checks in the native and recursion verifier paths. Merge on top of master instead of reverting master-only changes. ## Change Highlights - ceno_zkvm - restore verifier checks for dynamic heap/hint init tables - enforce heap/hint continuation and proof-size checks across shards - keep ZKVMVerifier / ZKVMVerifyingKey extensible with a mem-state verifier generic - merge master single-shard e2e verification flow and fix its halt expectation for debug shard verification - ceno_recursion - restore heap/hint bound checks in aggregation leaf verification - merge shard-id plumbing for single-shard e2e_aggregate - ceno_emul / ceno_rt - reconcile this PR's memory layout swap with master's removal of the PUBLIC I/O linker term - fix emulator dense-memory bounds for the merged layout ## Benchmark / Performance Impact No intended performance change beyond verifier-side checks. Previous measurements on this work showed negligible overhead; no new benchmark run was needed for the merge-only follow-ups. Benchmark command(s): not rerun for the merge-only follow-ups. Environment (CPU/GPU, core count, rust toolchain, commit hash): validated on local dev environment at head 7da2e88. raw data: - master: n/a - this PR: n/a ## Testing - cargo make clippy - cargo check --config net.git-fetch-with-cli=true -p ceno_zkvm -p ceno_recursion --bins --release - cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall - cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall - cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 --shard-id=0 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ## Risks and Rollout Main risk is verifier semantic drift between full-trace verification and single-shard debug verification. This branch keeps them separate: full-trace verification still owns entry/continuation checks, while single-shard debug verification checks only the selected shard segment. ## Follow-ups (optional) - add a dedicated regression for the single-shard non-halt case in CI if needed ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply .github/copilot-instructions.md strictly. --------- Co-authored-by: xkx <xiakunxian130@gmail.com>
related: #1265 # GPU Witness Generation Accelerate witness generation by offloading computation from CPU to GPU. This module (`ceno_zkvm/src/instructions/gpu/`) contains all GPU-side dispatch, caching, and utility code for the witness generation pipeline. The CUDA backend lives in the sibling repo `ceno-gpu/` (`cuda_hal/src/common/witgen/`). ## Architecture ### Module Layout ``` gpu/ ├── dispatch.rs — GPU dispatch entry point (try_gpu_assign_instances, gpu_fill_witness) ├── config.rs — Environment variable config (3 env vars), kind tags ├── cache.rs — Thread-local device buffer caching, shared EC/addr buffers ├── chips/ — Per-chip column map extractors + chip-specific GPU dispatch │ ├── add.rs ... sw.rs (24 RV32IM column map extractors) │ ├── keccak.rs (column map + keccak GPU dispatch: gpu_assign_keccak_instances) │ └── shard_ram.rs (column map + batch EC computation: gpu_batch_continuation_ec) ├── utils/ │ ├── column_map.rs — Shared column map extraction helpers (extract_rs1, extract_rd, ...) │ ├── d2h.rs — Device-to-host: witness transpose, LK counter decode, compact EC D2H │ ├── debug_compare.rs— GPU vs CPU comparison (activated by CENO_GPU_DEBUG_COMPARE_WITGEN) │ ├── lk_ops.rs — LkOp enum, SendEvent struct │ ├── sink.rs — LkShardramSink trait, CpuLkShardramSink │ ├── emit.rs — Emit helper functions (emit_u16_limbs, emit_logic_u8_ops, ...) │ ├── fallback.rs — CPU fallback: cpu_assign_instances, cpu_collect_lk_and_shardram │ └── test_helpers.rs — Test utilities: assert_witness_colmajor_eq, assert_full_gpu_pipeline └── mod.rs — Module declarations + lk_shardram integration tests (19 tests) ``` ### Data Flow ``` Pass 1: PreflightTracer ┌──────────────────────┐ │ ShardPlanBuilder │ → shard boundaries │ addr_future_accesses │ → next-access HashMap (GPU cache reads and sorts before H2D) └──────────┬───────────┘ │ Pass 2: FullTracer (per shard) ┌──────────▼───────────┐ │ Vec<StepRecord> │ 136 bytes/step, #[repr(C)] └──────────┬───────────┘ │ H2D (cached per shard in cache.rs) ┌──────────▼───────────────────────────────────┐ │ GPU Per-Instruction │ │ ┌─────────────┬──────────────┬────────────┐ │ │ │ F-1 Witness │ F-2 LK Count │ F-3 EC/Addr│ │ │ │ (col-major) │ (atomics) │ (shared buf)│ │ │ └──────┬──────┴──────┬───────┴─────┬──────┘ │ └─────────┼─────────────┼─────────────┼────────┘ │ │ │ GPU transpose D2H counters flush at shard end │ │ │ ┌─────────▼─────────────▼─────────────▼────────┐ │ CPU Merge │ │ RowMajorMatrix LkMultiplicity ShardContext │ └──────────────────────┬───────────────────────┘ │ ┌──────────────────────▼───────────────────────┐ │ ShardRamCircuit (GPU) │ │ Phase 1: per-row Poseidon2 (344 cols) │ │ Phase 2: binary EC tree (layer-by-layer) │ └──────────────────────┬───────────────────────┘ │ ▼ Proof Generation ``` ### Per-Shard Pipeline Within `generate_witness()` (e2e.rs), each shard executes: 1. **upload_shard_steps_cached** — H2D `Vec<StepRecord>` (cached, shared across all chips) 2. **ensure_shard_metadata_cached** — H2D shard scalars + allocate shared EC/addr buffers 3. **Per-chip dispatch** — `gpu_fill_witness` matches `GpuWitgenKind` → 22 kernel variants - Each kernel writes: witness columns (col-major), LK counters (atomics), EC records + addr (shared buffers) 4. **flush_shared_ec_buffers** — D2H shared EC records + addr_accessed into `ShardContext` 5. **invalidate_shard_steps_cache** — Free GPU shard_steps memory 6. **assign_shared_circuit** — ShardRamCircuit GPU pipeline (Poseidon2 + EC tree) ### GPU/CPU Decision (dispatch.rs) ``` try_gpu_assign_instances(): 1. is_gpu_witgen_enabled()? → CPU fallback if not set 2. is_force_cpu_path() thread-local? → CPU fallback (debug comparison) 3. I::GPU_LK_SHARDRAM == false? → CPU fallback 4. is_kind_disabled(kind)? → CPU fallback 5. Field != BabyBear? → CPU fallback 6. get_cuda_hal() unavailable? → CPU fallback 7. All pass → GPU path ``` ### Keccak Dispatch Keccak has a dedicated GPU dispatch path (`chips/keccak.rs::gpu_assign_keccak_instances`) separate from `try_gpu_assign_instances` because: 1. **Rotation**: each instance spans 32 rows (not 1), requiring `new_by_rotation` 2. **Structural witness**: 3 selectors (sel_first/sel_last/sel_all) vs the standard 1 3. **Input packing**: needs `packed_instances` with `syscall_witnesses` The LK/shardram collection logic is identical to the standard path. ### Lk and Shardram Collection After GPU computes the witness matrix, LK multiplicities and shard RAM records are collected through one of several paths (priority order): | Path | Witness | LK Multiplicity | Shard Records | When | |------|---------|-----------------|---------------|------| | **A** Shared buffer | GPU | GPU counters → D2H | Shared GPU buffer (deferred) | Default for all verified kinds | | **B** Compact EC | GPU | GPU counters → D2H | Compact EC D2H per-kernel | Older non-shared-buffer kinds | | **C** CPU shardram | GPU | GPU counters → D2H | CPU `cpu_collect_shardram` | GPU shard unverified | | **D** CPU full | GPU | CPU `cpu_collect_lk_and_shardram` | CPU full | GPU LK unverified | | **E** CPU only | CPU | CPU `assign_instance` | CPU `assign_instance` | GPU unavailable | Currently all non-Keccak kinds use **Path A**. Paths B-E are fallback/debug paths. ## E2E Pipeline Modes (e2e.rs) ``` create_proofs_streaming() │ ├─ Default GPU backend (CENO_GPU_ENABLE_WITGEN unset): │ Overlap pipeline: │ Thread A (CPU): witgen(shard 0) → witgen(shard 1) → witgen(shard 2) → ... │ Thread B (GPU): ................prove(shard 0) → prove(shard 1) → ... │ crossbeam::bounded(0) rendezvous channel for back-pressure │ └─ CENO_GPU_ENABLE_WITGEN=1 (GPU witgen) or CPU-only build: Sequential pipeline: witgen(shard 0) → prove(shard 0) → witgen(shard 1) → prove(shard 1) → ... GPU shared between witgen and proving; no overlap possible. ``` ## Environment Variables | Variable | Default | Purpose | |----------|---------|---------| | `CENO_GPU_ENABLE_WITGEN` | unset (CPU witgen) | Set to enable GPU witness generation. Sequential witgen+prove pipeline. | | `CENO_GPU_DISABLE_WITGEN_KINDS` | none | Comma-separated kind tags to disable specific chips' GPU path. Example: `add,keccak,lw`. Falls back to CPU for those chips. | | `CENO_GPU_DEBUG_COMPARE_WITGEN` | unset | Enable GPU vs CPU comparison for all chips. Runs both paths and diffs results. | ### `CENO_GPU_DEBUG_COMPARE_WITGEN` Coverage When set, all failures are collected into a `DebugCompareReport` (thread-local). Detailed mismatches are logged via `tracing::error!` in real time; at pipeline end `assert_debug_compare_report()` prints a summary table and panics if any failures exist. **Per-chip (in dispatch.rs, for each opcode circuit):** - `debug_compare_final_lk` — GPU LK multiplicity vs CPU `assign_instance` baseline (all 8 lookup tables) - `debug_compare_witness` — GPU witness matrix vs CPU witness (element-by-element) - `debug_compare_shardram` — GPU shard records (read_records, write_records, addr_accessed) vs CPU - `debug_compare_shard_ec` — GPU compact EC records vs CPU-computed EC points (nonce, x[7], y[7]) **Per-chip, Keccak-specific (in chips/keccak.rs):** - `debug_compare_keccak` — Combined witness + LK + shard comparison for keccak's rotation-aware layout **ShardRamCircuit (in chips/shard_ram.rs):** - `debug_compare_shard_ram_witness` — GPU ShardRam witness vs CPU baseline (from ShardRamInput) - `debug_compare_shard_ram_witness_from_device` — GPU ShardRam witness vs CPU baseline (D2H device buffer → convert → CPU assign) **Per-shard, E2E level (in e2e.rs, all chips combined):** - `log_shard_ctx_diff` — Aggregated addr_accessed comparison (write/read_records skipped when GPU witgen enabled) - `log_combined_lk_diff` — Merged LK multiplicities after `finalize_lk_multiplicities()` (catches cross-chip merge issues) ## Tests **79 tests total** (`cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu"`) | Category | Count | Location | What it tests | |----------|------:|----------|---------------| | Column map extraction | 33 | `chips/*.rs` (31 via `test_colmap!` macro + 2 manual) | Circuit config → column map: all IDs in-range and unique | | GPU witgen correctness | 23 | `chips/*.rs` | GPU kernel output vs CPU `assign_instance` (element-by-element witness comparison) | | LK+shardram match | 19 | `gpu/mod.rs` | `collect_lk_and_shardram` / `collect_shardram` vs `assign_instance` baseline | | LkOp encoding | 1 | `utils/mod.rs` | `LkOp::encode_all()` produces correct table/key pairs | | EC point match | 1 | `scheme/septic_curve.rs` | GPU Poseidon2+SepticCurve EC point vs CPU `to_ec_point` | | Poseidon2 sponge | 1 | `scheme/septic_curve.rs` | GPU Poseidon2 permutation vs CPU | | Septic from_x | 1 | `scheme/septic_curve.rs` | GPU `septic_point_from_x` vs CPU | ### Running Tests ```bash # All GPU tests (requires CUDA device) CENO_GPU_ENABLE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu" # Column map tests only (no CUDA device needed) cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "test_extract_" # LK/shardram tests only (no CUDA device needed) cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "lk_shardram" # With debug comparison enabled CENO_GPU_ENABLE_WITGEN=1 CENO_GPU_DEBUG_COMPARE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_host -- test_elf ``` ## Per-Chip Boilerplate Macros Three macros in `instructions.rs` reduce per-chip GPU integration to ~3 lines: ```rust impl Instruction<E> for MyChip { // Emit LK ops + shard RAM records (CPU companion for GPU witgen) impl_collect_lk_and_shardram!(r_insn, |sink, step, _config, _ctx| { emit_u16_limbs(sink, step.rd().unwrap().value.after); }); // Collect shard RAM records only (when GPU handles LK) impl_collect_shardram!(r_insn); // GPU dispatch: try GPU → fallback CPU impl_gpu_assign!(dispatch::GpuWitgenKind::Add); } ``` --------- Co-authored-by: Ming <hero78119@gmail.com> Co-authored-by: xkx <xiakunxian130@gmail.com> Co-authored-by: Ray Gao <qg2153@columbia.edu>
Add docs follows #1223 ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com> Co-authored-by: xkx <xiakunxian130@gmail.com>
## Problem
A malformed proof could cause the verifier process to crash — any
`unwrap` / `expect` / unchecked indexing / `assert!` on proof-derived
data is a liveness / DoS risk (a crafted proof kills the process instead
of being cleanly rejected).
## Changes
- **Panic cleanup.** Convert every `assert!` / `assert_eq!` / `unwrap` /
`expect` / unchecked indexing on proof-derived data to `ZKVMError`
returns across `verify_proofs_halt`, `verify_proof_validity`,
`verify_chip_proof`, `TowerVerify::verify`, and
`EccVerifier::verify_ecc_proof`. A malformed proof is now rejected
cleanly in all paths.
- **Document the verifier's semantic contract.** New sections in
`CLAUDE.md` and `docs/src/technical-overview.md` ("What the verifier
guarantees") state the two program-level facts a valid Ceno proof
attests to: **execution starts at `vk.entry_pc`** and **the terminal
shard invokes the halt ecall**. The exit code is deliberately *not* a
verifier guarantee — `public_values.exit_code` is bound by the
halt-ecall chip to register `a0`, but the guest program defines its own
exit-code semantics, so a non-zero value may be a legitimate application
signal. Callers that want "exited successfully" compare `exit_code == 0`
themselves.
- `CLAUDE.md` additionally flags prefix proofs (`expect-halt = false`)
as a dev/bench affordance, not a production surface. This caveat is
contributor-facing and is kept out of the user-facing mdbook.
## Test plan
- [x] `cargo check --workspace --all-targets`
- [x] `cargo make clippy` (workspace, `-D warnings`)
- [x] `cargo test -p ceno_zkvm --lib` (152 passed)
- [x] `cargo test -p ceno_zkvm --lib scheme::` (18 passed)
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Trigger the reth benchmark workflow in `scroll-tech/ceno-reth-benchmark` when the `regression-e2e-reth` label is added to a PR. - dispatches `run-benchmark-v2.yml` on repo B - passes the PR head SHA as `ceno_version` - uses default benchmark block `23817600` - reports benchmark results back to the PR as a comment
…ter (#1319) ## Problem This PR carries forward the Fiat-Shamir soundness fix for prover-supplied evaluations onto current `master`. Historically, several prover-provided evaluations were included in proofs but not always absorbed into the transcript before later challenges were sampled. That leaves room for internally consistent forgeries if prover and verifier do not bind the same data in the same order. This work is extended by with few decorations and builds on the original fix in [#1294](#1294) by @MavenRain. Many thanks for identifying the issue clearly and putting together the first end-to-end patch. ## Design Rationale The goal here is to land the same soundness principle on top of the newer codebase with minimal semantic drift: - keep prover/verifier transcript ordering aligned - preserve the original fix's intent while rebasing onto current `master` - factor repeated transcript-binding logic into small helpers where that improves reviewability - document a few subtle data-layout assumptions so future refactors are less likely to break transcript consistency This PR is intentionally not a larger transcript-architecture refactor. It keeps the patch narrow and practical for the current code structure. ## Change Highlights - `gkr_iop` - bind final sumcheck / zerocheck / rotation evaluations into the transcript in the verifier path - factor the binding step into a small helper to make the transcript rule explicit - `ceno_zkvm` - keep tower verifier transcript binding aligned with the prover for active prod/logup rounds - document the `TowerProofs` layout so it is clear that only active rounds are stored - `ceno_recursion` - mirror the same transcript-binding behavior in the recursion verifier DSL - factor repeated challenger-observe logic into local helpers for readability - merge current `master` - resolve drift against the latest branch layout and verifier offsets ## Benchmark / Performance Impact This is a soundness/correctness fix. No meaningful performance change is intended. ## Testing ```sh cargo check -p gkr_iop -p ceno_zkvm -p ceno_recursion ``` ## Risks and Rollout Main risk is transcript-order mismatch across prover, verifier, and recursion verifier. This PR keeps those paths aligned and uses small helpers/comments to make the ordering easier to audit. Rollback is straightforward: revert this PR if any transcript compatibility issue is discovered before merge. ## Follow-ups (optional) - A more systematic long-term design would make transcript observation a more explicit streaming interface across prover and verifier boundaries. - The tower-level GPU/mock path mentioned in [#1294](#1294) remains a useful follow-up area if it is still relevant in the surrounding repos. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: Onyeka Obi <Onyeka.Obi@gmail.com>
The 6 slow basefold-verifier tests in `ceno_recursion` pin each `cargo make tests` pass at ~10 min, ~20 min across both feature-set runs. Mark them `#[ignore]` so default CI skips them; run locally with `cargo test -p ceno_recursion --lib -- --ignored --skip aggregation`. No CI step runs them in `--ignored` mode yet — follow-up if we want merge-queue to still exercise them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Ming <hero78119@gmail.com>
refer from https://github.com/marketplace/actions/workflow-dispatch ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
Fan out the Tests job into a 2-leg matrix so default and goldilocks run on separate runners. Each leg gets its own cache key to avoid thrash. Status-check names change to \`Run Tests (default)\` / \`Run Tests (goldilocks)\` — branch-protection / merge-queue required checks need to be updated when this lands. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ming <hero78119@gmail.com>
## Summary This PR makes the GPU prover compact/default-aware for MLE inputs while keeping protocol-facing behavior unchanged. The verifier, PCS dimensions, transcript, and sumcheck domains still use the logical domain; only GPU-resident prover data can stay compact by occupied rows. Companion CUDA kernel changes: scroll-tech/ceno-gpu#146. ## Compact vs logical domains - Logical domain is the constraint/sumcheck/PCS domain: `num_vars`, power-of-two padding, and rotation expansion remain verifier-visible and unchanged. - Compact domain is the occupied physical row range kept on GPU. Keccak is the main stress case: each logical syscall instance expands to 32 physical rows, and compact buffers avoid padding those rows to the full logical domain. - Missing compact tail entries are represented by `tail_default`. Most witness tails are zero; logup shared numerator can use one. - CPU-side structures remain logical for compatibility. GPU proving carries compact resident length plus logical metadata and materializes logical shape only at boundaries that require it. ```text Logical domain, protocol view: rows used by constraints / transcript / PCS / sumcheck [ occupied physical rows ][ logical tail padding ........ ] <----------------------- 2^num_vars ----------------------> Compact GPU resident view: [ occupied physical rows ] + metadata { logical num_vars, tail_default } Kernel read rule: if index < occupied_len: read compact[index] else: read tail_default ``` ## Flow modes - CPU backend: unchanged logical host MLE/RMM behavior. - GPU backend with `CENO_GPU_ENABLE_WITGEN=0`: CPU witgen still produces host traces; GPU proving extracts/copies occupied rows into compact GPU MLE specs while preserving logical `num_vars` for constraints and sumcheck. - GPU backend with `CENO_GPU_ENABLE_WITGEN=1`: GPU witgen can feed compact device-backed traces or replay-materialized inputs directly into the same compact proving path. - Replay-heavy Keccak/ShardRam: compact tower inputs are materialized for tower, released, then rematerialized for ECC/rotation/main constraints so peak VRAM is lower without changing proof semantics. ```text CPU backend, no compact GPU semantics: CPU witgen -> logical RowMajorMatrix / MLEs -> CPU prover stages -> PCS / transcript / verifier all see logical domain ``` ```text GPU backend, CENO_GPU_ENABLE_WITGEN=0: CPU witgen -> logical host traces / committed PCS data -> per-chip GPU extraction host logical rows -> compact GPU MLE specs keep { occupied_len, logical num_vars, tail_default } -> shared GPU proving stages tower -> ECC -> rotation -> main constraints -> opening -> PCS / transcript / verifier still see logical domain ``` ```text GPU backend, CENO_GPU_ENABLE_WITGEN=1: GPU witgen -> compact device-backed traces or replay sources -> deferred commit / replay materialization when needed -> shared GPU proving stages tower -> ECC -> rotation -> main constraints -> opening -> PCS / transcript / verifier still see logical domain Keccak / ShardRam replay lifetime: materialize compact tower input -> prove tower -> drop tower input -> rematerialize for ECC / rotation / main -> open committed traces ``` ## Proving semantics - Sumcheck runs over logical domains. Compact metadata only changes how GPU kernels read resident buffers and defaults for omitted tail entries. - Tower build/prove consumes compact product/logup inputs directly and avoids carrying full-domain padded tower inputs through the proof lifetime. - Rotation/main GKR use the same logical constraint domains while accepting compact/default-aware GPU MLE inputs where the kernels support it. - Scheduler-facing estimates and memtracking distinguish compact resident bytes from logical-domain temporary bytes to avoid both under-booking and double-counting. ```text Shared GPU proving path: compact/default-aware MLE specs { ptr, occupied_len, logical num_vars, tail_default } | +--> tower build/prove | - compact product/logup inputs | - logical sumcheck rounds | +--> rotation / main GKR | - logical constraint domain | - compact/default-aware reads | +--> PCS opening - verifier-visible logical dimensions unchanged ``` ## Unified paths - CPU witgen + GPU proving and GPU witgen + GPU proving now share the same compact chip proof stages: tower, ECC, rotation, main constraints, and PCS opening. - Product/logup tower construction is centralized around compact specs, including the scalar-one logup numerator case. - Sequential and concurrent chip proving use the same estimator model, with memtracking checks available to catch estimator drift. ## Reviewer focus - Boundaries between compact resident length and logical `num_vars`. - `tail_default` handling in sumcheck/tower, especially non-zero logup numerator defaults. - Keccak rotated physical rows and ShardRam replay/materialization lifetime. - Scheduler estimates for `CENO_GPU_ENABLE_WITGEN=0/1` and `CENO_CONCURRENT_CHIP_PROVING=0/1`. - Verifier/protocol parity: this PR should not change proof format or transcript semantics. ## Benchmark Source runs: - Baseline: [ceno-reth-benchmark run 25004787999 attempt 1](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25004787999/attempts/1#summary-73224449314), result [mainnet23817600-20260427-234425](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/ceno/mainnet23817600-20260427-234425_summary.md) - This PR: [ceno-reth-benchmark run 25004860748 attempt 2](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25004860748/attempts/2#summary-73295559209), result [mainnet23817600-20260428-074320](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/prover_mle_zero_padding/mainnet23817600-20260428-074320_summary.md) Block: `23817600`. Per-operation app-prove rows are profile totals across overlapped shard work, so they can exceed wall time; E2E/app-prove rows are the wall-time comparison. | Metric | Baseline | This PR | Delta | Change | |--------|----------|---------|-------|--------| | E2E total time | 81.900s | 80.900s | -1.000s | -1.22% | | app_prove wall time | 67.200s | 66.300s | -0.900s | -1.34% | | emulator | 10.400s | 10.500s | +0.100s | +0.96% | | commit_traces | 8.075s | 8.049s | -0.026s | -0.32% | | extract_witness_mles | 27.569s | 28.837s | +1.268s | +4.60% | | transport_structural_witness | 3.475s | 3.058s | -0.417s | -12.00% | | build_tower_witness_gpu | 4.711s | 3.413s | -1.298s | -27.55% | | prove_tower_relation_gpu | 178.197s | 188.436s | +10.239s | +5.75% | | prove_main_constraints | 24.464s | 24.238s | -0.226s | -0.92% | | pcs_opening | 17.892s | 17.716s | -0.176s | -0.98% | | CPU/GPU overlap gap | 3.910s | 3.930s | +0.020s | +0.51% | Peak memory is extracted from concurrent benchmark job logs by taking the max of `[gpu device]` snapshots. `pool_booked` is scheduler reservation/estimate, not actual VRAM usage. | Memory metric | Baseline peak | This PR peak | Drop | Drop % | |---------------|--------------:|-------------:|-----:|-------:| | `cuda_used` | 23637.19 MB | 21557.19 MB | 2080.00 MB | 8.80% | | `pool_used` | 21792.58 MB | 19267.89 MB | 2524.69 MB | 11.59% | | `pool_reserved` | 23136.00 MB | 21056.00 MB | 2080.00 MB | 8.99% | | `pool_booked` | 23180.86 MB | 23180.87 MB | -0.01 MB | -0.00% | Summary: wall time is slightly faster in this run (`81.9s -> 80.9s`). Peak VRAM is lower (`cuda_used`: `23637.19 MB -> 21557.19 MB`, -`2080.00 MB` / -`8.80%`; `pool_reserved`: `23136.00 MB -> 21056.00 MB`, -`2080.00 MB` / -`8.99%`). Compact tower build is materially faster (`4.711s -> 3.413s`), while the overlapped tower proving profile total is higher (`178.197s -> 188.436s`); because chip proving overlaps across shards, the wall-time result is the primary performance signal. ## Validation commands ```sh cargo check --features gpu --package ceno_zkvm --bin e2e cargo make clippy CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall CENO_GPU_MEM_TRACKING=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ```
## Problem Main sumcheck was proved and verified per chip, which duplicated transcript work, selector/claim handling, and PCS opening plumbing across chips. This PR include Jagged PCS integration. For benchmark result, see another PR #1336 ## Design Rationale Use one global batched main sumcheck proof while keeping PCS openings in the existing suffix path. The verifier mirrors the prover transcript order, including ECC bridge sampling before the global `combine subset evals` challenge, and evaluates frontloaded expressions in the verifier. ## Change Highlights - `ceno_zkvm`: batches main constraints into a single global proof path across chip proofs. - `ceno_zkvm`: keeps witness/fixed PCS openings per chip after global main verification. - `ceno_recursion`: mirrors native verifier changes for the batched main proof. - `ceno-gpu`: supports the batched main proving flow. ## Benchmark / Performance Impact ### CPU Integration E2E Local CPU sanity compares PR CPU batched-main against a local `master` baseline on `secp256r1_verify_prehash`. | Case | Command Target | Shard Proof | vs Baseline | Result | |---|---|---:|---:|---| | Baseline `master` | `ceno_zkvm e2e --platform=ceno .../secp256r1_verify_prehash` | 37.378s | Baseline | Pass | | PR + gkr-backend worker-bit merge optimization | same target | 41.974s | -1.12x | Pass | CPU result: batched main is now `1.123x` slower than baseline (`+12.30%`) on this integration target, rather than the previous timeout-scale regression. ### GPU Reth Benchmark Benchmark session compares the frontload baseline against successive `feat/batch_main_sumcheck` optimization runs on block `23817600`, GPU proving, `CENO_GPU_ENABLE_WITGEN=0`. Comparison convention: lower time is better. Signed `x` values use `-Nx` for slower-than-baseline wall time and `+Nx` for faster/lower-time metrics; for example, taking twice as long is `-2.00x`. ### Timeline / Optimization Progress | Date | Run | Ceno / GPU Commit | E2E | vs Baseline | app_prove | vs Baseline | prove_batched_main_constraints | Short Highlight | |---|---|---|---:|---:|---:|---:|---:|---| | May 6 | [25419833788 / job 74559223217](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25419833788/job/74559223217) | Ceno `7a07649b`, GPU `1118dca8` | 75.600s | Baseline | 61.000s | Baseline | 0.000s | **Baseline**: frontload, per-chip main constraints | | May 9 AM | [25594090744 / job 75136918384](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25594090744/job/75136918384) | Ceno `dd229c00`, GPU `340651b4` | 103.000s | -1.36x | 87.400s | -1.43x | 0.000s | Batched branch after alpha.28 upgrade; tower/extract totals much lower but wall time regressed | | May 9 PM | [25603601935 / job 75161599043](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25603601935/job/75161599043) | Ceno `d5ae1b3a`, GPU `fbef26f3` | 104.000s | -1.38x | 88.300s | -1.45x | 26.925s | Batched main proof enabled; new batched-main critical path dominates | | May 11 | [25655529702 / job 75302942526](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25655529702/job/75302942526) | Ceno `c2c45cc9`, GPU `3dedbc78` | 91.800s | -1.21x | 76.500s | -1.25x | 15.457s | Latest optimization: direct batched-main construction + bucketed fold/eval GPU sumcheck | ### E2E / Layer | Metric | Baseline | Latest Optimization | Comparison | |---|---:|---:|---:| | E2E total | 75.600s | 91.800s | -1.21x | | emulator | 10.100s | 10.200s | -1.01x | | app_prove wall time | 61.000s | 76.500s | -1.25x | ### App Prove Breakdown Profiler module totals can overlap because chip proving is concurrent; use `app_prove wall time` above for critical-path impact. The latest run materially reduces the new batched-main cost, but total wall time is still slower than the frontload baseline. | Operation | Baseline | Batched May 9 AM | Batched May 9 PM | Latest May 11 | Latest vs Baseline | |---|---:|---:|---:|---:|---:| | prove_batched_main_constraints | 0.000s | 0.000s | 26.925s | 15.457s | New cost | | prove_main_constraints | 22.622s | 0.000s | 0.000s | 0.000s | Removed | | extract_witness_mles | 24.155s | 3.760s | 3.713s | 3.739s | +6.46x | | build_tower_witness_gpu | 3.491s | 0.323s | 0.316s | 0.323s | +10.81x | | prove_tower_relation_gpu | 176.090s | 24.008s | 24.417s | 24.857s | +7.08x | | pcs_opening | 15.246s | 15.207s | 15.164s | 15.175s | +1.00x | | commit_traces | 6.827s | 6.814s | 6.851s | 6.857s | -1.00x | | parsed rows total | 251.118s | 50.995s | 78.287s | 67.460s | +3.72x | ### Latest Improvement Against Previous Batched Run | Metric | May 9 PM Batched Main | May 11 Latest | Improvement | |---|---:|---:|---:| | E2E total | 104.000s | 91.800s | +1.13x | | app_prove wall time | 88.300s | 76.500s | +1.15x | | prove_batched_main_constraints | 26.925s | 15.457s | +1.74x | | parsed rows total | 78.287s | 67.460s | +1.16x | Benchmark command: ```sh CENO_GPU_ENABLE_WITGEN=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_CACHE_LEVEL=0 \ RUSTFLAGS="-C target-feature=+avx2" \ cargo run --features "jemalloc,gpu" --release --bin ceno-reth-benchmark-bin -- \ --mode prove-app --block-number 23817600 --rpc-url <redacted> \ --output-dir output --cache-dir rpc-cache ``` Environment: - GitHub self-hosted GPU runner, CUDA device `cc=8.9`, `24GB` GPU memory. - Rust `nightly-2025-11-20`, cargo `1.93.0-nightly`. - Baseline: [run 25419833788 / job 74559223217](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25419833788/job/74559223217), Ceno `7a07649b`, GPU `1118dca8`, [summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/frontload/mainnet23817600-20260506-142423_summary.md). - 2026-05-09 early batched branch: [run 25594090744 / job 75136918384](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25594090744/job/75136918384), Ceno `dd229c00`, GPU `340651b4`, [summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260509-142948_summary.md). - 2026-05-09 batched-main critical path: [run 25603601935 / job 75161599043](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25603601935/job/75161599043), Ceno `d5ae1b3a`, GPU `fbef26f3`, [summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260509-223459_summary.md). - Latest optimization: [run 25655529702 / job 75302942526](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25655529702/job/75302942526), Ceno `c2c45cc9`, GPU `3dedbc78`, [summary](https://github.com/scroll-tech/ceno-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/feat/batch_main_sumcheck/mainnet23817600-20260511-150859_summary.md). Summary: latest optimization improves `prove_batched_main_constraints` by `+1.74x` against the previous batched-main run (`26.925s -> 15.457s`) and improves E2E by `+1.13x` (`104.000s -> 91.800s`). It remains slower than the frontload baseline (`75.600s -> 91.800s`, `-1.21x`), with the remaining gap concentrated in the new batched-main critical path. ## Testing ```sh RUST_MIN_STACK=33554432 cargo check --package ceno_recursion --bin e2e_aggregate RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` Also passed the linked GPU e2e benchmark run above. ## Risks and Rollout - Soundness risk is concentrated in transcript ordering and verifier frontload evaluation; native and recursion verifiers now follow the same global proof flow. - Performance is not yet an E2E win in the linked benchmark despite removing per-chip main-constraint cost; further scheduling/host-overlap work is needed before rollout as a performance improvement. ## Follow-ups - Investigate reducing the new `prove_batched_main_constraints` critical-path cost. - Keep benchmark summaries explicit that parsed module totals overlap and are not a wall-time decomposition. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: Velaciela <git.rover@outlook.com>
## Problem `GPU_WITGEN,CACHE=1` can produce witness traces on GPU, but the prover path still needs a clean device-resident commit flow. The goal is to keep GPU-generated witness data usable through commit without falling back to replay/deferred raw-cache logic or unnecessary host materialization. ## Design Rationale This PR treats GPU witness output as the source of truth for the commit path: traces are normalized into device-backed row-major metadata, committed through the GPU PCS path, and released once q'/commit no longer needs the raw backing. The post-commit proving flow stays aligned with the existing `CPU_WITGEN` path so correctness-sensitive transcript, opening, and proof assembly logic remain shared. The design avoids retaining replay plans as a second witness source. This keeps ownership simpler: GPU witness generation owns raw device buffers until q'/commit construction, then releases them before chip proving pressure grows. ## Change Highlights - `ceno_zkvm`: add GPU witness/device-backed trace commit path for `GPU_WITGEN,CACHE=1`. - `ceno_zkvm`: keep post-commit proving and opening flow shared with the existing GPU prover path. - `ceno_zkvm`: release shard GPU witness caches after proof construction. - `gkr_iop`: support GPU-side batched main-constraint proving integration. ## CI Benchmark Summary Compared CI benchmark runs: - `GPU_WITGEN`: original PR benchmark numbers, kept for context. - `CPU_WITGEN`: [`26067686212`](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/26067686212), branch `feat/witgen_gpu`, `CENO_GPU_ENABLE_WITGEN=0` - `CPU_WITGEN (baseline)`: [`26037135648`](https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/26037135648), branch `feat/update_dep`, `CENO_GPU_ENABLE_WITGEN=0` | Metric | GPU_WITGEN | CPU_WITGEN | CPU_WITGEN (baseline) | Notes | |---|---:|---:|---:|---| | reth-block E2E | 111s | 80.2s | 83.2s | CPU_WITGEN feature branch is fastest | | app.prove | 107s | 65.6s | 68.2s | CPU_WITGEN feature branch improves 2.6s vs baseline | | app_prove.inner | 96.6s | 65.6s | 68.2s | Same trend as app.prove | | Witness total | 35.43s | 40.85s | 39.84s | GPU_WITGEN remains faster raw witness gen | | Proof total | 60.70s | 62.15s | 64.78s | CPU_WITGEN feature branch improves proof total vs baseline | | commit_traces total | 12.35s | 17.410s | 17.450s | GPU_WITGEN commit path remains faster | | commit_traces avg/shard | 950ms | n/a | n/a | Original GPU_WITGEN per-shard metric kept | | prove_tower_relation_gpu total | n/a | 119.624s | 22.515s | Nested/overlapped span increased in feature run | | prove_batched_main_constraints total | n/a | 7.934s | 7.639s | Slight CPU_WITGEN regression | | pcs_opening total | 9.91s | 9.857s | 10.061s | Stable | | q commit total | 8.25s device_q | n/a | n/a | Original GPU_WITGEN q metric kept | | q commit avg/shard | 634ms | n/a | n/a | Original GPU_WITGEN q metric kept | | q inner commit avg | 449ms | n/a | n/a | Original GPU_WITGEN q metric kept | | CPU/GPU overlap gap | n/a | 3.170s | 3.200s | CPU_WITGEN overlap unchanged | | Overall result | 111s | 80.2s | 83.2s | CPU_WITGEN feature branch beats baseline; GPU_WITGEN still loses overall due to lost overlap | | Conclusion | Evidence | |---|---| | GPU_WITGEN still improves commit/witness subpaths | Original GPU_WITGEN has faster witness total and commit_traces than CPU_WITGEN | | GPU_WITGEN still loses overall | 111s E2E vs 80.2s CPU_WITGEN due to lost shard witness/proof overlap | | CPU_WITGEN feature branch is slightly faster than CPU_WITGEN baseline | reth-block improves by 3.0s; app.prove improves by 2.6s | | Commit/opening path is stable for CPU_WITGEN | commit_traces and pcs_opening are within ~0.2s across CPU runs | ## Benchmark / Performance Impact This is performance-sensitive. CI benchmark runs are used for comparable end-to-end numbers because local wall time depends heavily on runner scheduling and GPU availability. ### Operation | Operation | master (s) | this PR (s) | Improve (master -> this PR) | |-----------|------------|-------------|-----------------------------| | Reth proving benchmark | See benchmark CI | See benchmark CI | See benchmark CI | ### Layer | Layer | master (s) | this PR (s) | Improve (master -> this PR) | |-------|------------|-------------|-----------------------------| | Witness commit/q' path | Host/materialized path | Device-backed GPU path | Reduces host materialization and extra copies | | Post-commit proving | Existing GPU flow | Existing GPU flow | Intended to remain unchanged | Benchmark command(s): ```sh # ceno-reth-benchmark CI, GPU_WITGEN,CACHE=1 and CPU_WITGEN,CACHE=1 comparison runs ``` Environment (CPU/GPU, core count, rust toolchain, commit hash): CI benchmark runner metadata and commit hashes are recorded in the linked workflow runs. raw data: - master: benchmark CI artifacts - this PR: benchmark CI artifacts ## Testing ```sh cargo fmt --check cargo check -p ceno_zkvm --features 'gpu,u16limb_circuit' --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".cuda_hal.path="../ceno-gpu/cuda_hal"' ``` ## Risks and Rollout - Main risk is lifetime/ownership mistakes around device-backed witness buffers; the rollout keeps release points explicit and avoids replay cache ownership. - If regressions appear, disable `CENO_GPU_ENABLE_WITGEN` to return to the existing `CPU_WITGEN` GPU proving path. ## Follow-ups (optional) - Continue profiling per-chip GPU witness generation and q' construction. - Add scheduler-level overlap once device memory booking is precise enough. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## Problem Ceno opens witness and fixed Jagged commitments in the same proof, but the previous integration still paid for separate inner Basefold openings. That duplicates inner query/opening proof bytes even though witness/fixed can share the same inner Basefold query set. ## Design Rationale Keep the outer Jagged protocol literally separate while sharing only the inner Basefold opening: - witness and fixed keep separate Jagged commitments and separate Merkle roots - witness and fixed keep separate Jagged sumcheck/assist rounds - each round keeps its own q' shape and reshape height from its existing commitment lifecycle - Ceno collects each Jagged round's inner opening claims and calls one Basefold `batch_open_with_trace_materializer` - the resulting `JaggedProof` contains all Jagged rounds plus one required shared `inner_proof` This is intentionally surgical: it does not refactor witness lifecycle, q' ownership, q' residency, or fixed/witness commit paths. The only integration change is moving inner Basefold opening from per-Jagged-round execution to one batched inner opening after all Jagged reductions. Soundness/correctness rationale: prover and verifier transcript order is aligned with gkr-backend: absorb all Jagged round reductions first, then absorb/verify one inner Basefold opening over all inner claims. Commitments remain independent, so sharing the inner proof does not merge witness/fixed roots. ## Change Highlights - `ceno_zkvm/src/scheme/gpu/mod.rs` - GPU Jagged opening now collects `(round_proof, rho_row, col_evals)` for each Jagged round. - Builds per-round inner opening claims without changing q' lifecycle. - Calls one shared GPU Basefold `batch_open_with_trace_materializer` for witness/fixed inner claims. - `ceno_zkvm/src/scheme.rs` - Extends existing proof-size display to include PCS-specific nested breakdowns. - `Cargo.toml` - Pins `gkr-backend` and `ceno-gpu` dependencies to `feat/jagged_single_commit`. ## Benchmark / Performance Impact ### Operation | Operation | master (s) | this PR (s) | Improve (master -> this PR) | |-----------|------------|-------------|-----------------------------| | `reth-block` | 11.143 | 11.087 | +0.056s | | `app.prove` | 10.579 | 10.559 | +0.020s | | `create_proof_of_shard` | 6.315 | 6.509 | -0.194s | | `commit_traces` | 1.354 | 1.535 | -0.181s | | `pcs_opening` | 1.492 | 1.483 | +0.009s | | shard-0 proof size | 6.15 MiB | 5.48 MiB | +0.67 MiB smaller (-10.96%) | ### Layer | Layer | master (s) | this PR (s) | Improve (master -> this PR) | |-------|------------|-------------|-----------------------------| | Basefold witness commit/query | 0.423 / 0.270 | 0.427 / 0.269 | approximately flat | | Basefold fixed commit/query | 0.0147 / 0.00325 | folded into shared opening | removes separate inner proof | | Jagged outer rounds | separate witness/fixed | separate witness/fixed | unchanged by design | ### Proof-size breakdown, shard 0 Sizes are MiB, computed as bytes / 2^20. Percent is relative to the after proof file (`5.48 MiB`). | Component | Size | % of Proof | |---|---:|---:| | Proof file `app_proof.bitcode` | 5.48 MiB | 100.00% | | `mpcs_opening.total` | 1.09 MiB | 19.95% | | `mpcs_opening.rounds` | 0.008 MiB | 0.14% | | `round[0]` witness Jagged round | 0.005 MiB | 0.09% | | `round[1]` fixed Jagged round | 0.003 MiB | 0.05% | | shared `inner_proof` | 1.09 MiB | 19.81% | | `inner_proof.query_opening_proof` | 1.08 MiB | 19.78% | | `inner_proof.commits` | 0.001 MiB | 0.01% | | `inner_proof.sumcheck_proof` | 0.001 MiB | 0.02% | | `inner_proof.final_message` | 0.0003 MiB | 0.01% | Raw proof-size data: | Metric | Before | After | Delta | |---|---:|---:|---:| | Shard 0 proof file | 6,453,943 B | 5,746,603 B | -707,340 B (-10.96%) | | Single proof object | n/a | 5,746,602 B | n/a | | `mpcs_opening.total` | n/a | 1,146,525 B | n/a | | shared `inner_proof` | n/a | 1,138,405 B | n/a | ## Testing ```sh cargo fmt -p ceno_zkvm cargo check -p ceno_zkvm --features gpu ``` Additional dependency checks: ```sh # gkr-backend cargo fmt -p mpcs cargo check -p mpcs cargo check -p mpcs --all-targets # ceno-gpu cargo fmt -p cuda_hal cargo check -p cuda_hal --features bb31 ``` E2E validation: - `../ceno-reth-benchmark`, block `23587691`, shard `0`, GPU, `prove-app`, verifier passed. ## Risks and Rollout Risk is transcript/order mismatch because inner Basefold proof generation moved out of each Jagged round and into one shared call after all Jagged reductions. The verifier mirrors this order in gkr-backend, and shard-0 e2e verification passed. Performance risk is low: proof size improves materially while `pcs_opening` is flat on the measured shard. `commit_traces` varied upward in this run, but this PR does not change commit lifecycle or q' materialization. Rollback is localized: restore the previous per-Jagged-round inner opening path and the old dependency pins. ## Follow-ups (optional) None required for this PR. Broader cleanup can later remove temporary local benchmark patching once dependency branches are merged. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## Problem Issue #1338 reproduces a soundness break on `master`. For the same RISC-V execution, the base verifier *and* the recursion verifier both accept two distinct proof batches whose public per-shard `shard_rw_sum` values differ on all 17 shards. The attacker takes an honest witness, replaces every cross-shard EC accumulator leaf `(x, y)` with its inverse `(x, -y)`, updates `shard_rw_sum`, and reproves. Root cause: `ceno_zkvm/src/tables/shard_ram.rs:276-281` was a TODO. The host code in `ShardRamRecord::to_ec_point` encodes read vs write in the sign of `y[6]`, but the circuit only constrained the curve equation and the EC sum — never tying `y[6]`'s half-of-field to `is_global_write`. Both `(x, y)` and `(x, -y)` satisfied every existing check, so the public summary of cross-shard RAM flow was unbound. The defect survives recursion (the reporter's PoC verifies through the recursion verifier program). ## Design Rationale Approach borrows the **idea** from SP1's `crates/core/machine/src/operations/global_interaction.rs:210-236`, not its column layout. Three pieces: 1. **Offset by +1.** Express `y[6]` in terms of a fresh witness `y6_lo` so `y[6] = 0` is never valid in either branch (it is invariant under the negate operation, thus make it impossible to distinguish read and write). 2. **Safe band + prover retry.** Restrict `y6_lo` to `[0, (p-1)/2)`. For the rare exception `y[6] = 0` (probability `~1/p ≈ 2^-31` per record) the host rejects and retries with a new `nonce`. 3. **Byte-decomposition range check.** `y6_lo` decomposed into four byte limbs `b0..b3` (`assert_byte` for `b0..b2`, `lookup_ltu_byte(b3, 60, 1)` for `b3`). For BabyBear, `(p-1)/2 = 60·2^24` exactly, so `b3 < 60` gives the tightest no-overlap band. In-circuit branch equality via `condition_require_equal`: - read (`is_global_write = 0`): `y[6] = y6_lo + 1` ⇒ `y[6] ∈ [1, (p-1)/2]` - write (`is_global_write = 1`): `y[6] = p - 1 - y6_lo` ⇒ `y[6] ∈ [(p+1)/2, p-1]` Union covers `[1, p-1]` with no overlap; `y[6] = 0` is excluded. **Why not a single `AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)`?** On BabyBear (`p = 0x78000001`, 31-bit) the AssertLt gadget only constrains `lhs - rhs ≡ diff - 2^max_bits (mod p)` with `diff ∈ [0, 2^30)` — it does not pre-bound `lhs` to be canonical-small. A malicious `y6_lo ∈ [0x74000001, p-1]` (≈ 2^26 values) produces a *field-wrap* diff that still fits in 30 bits, so the constraint accepts upper-half values and the exploit survives. Byte-decomposing first kills the wrap. Ceno's `DynamicRangeTableCircuit<E, 18>` also does not carry 30-bit lookup entries, so a direct `assert_const_range(_, 30)` is not available anyway. **Why M = 60 (vs SP1's 63).** SP1 targets KoalaBear; its `(p-1)/2 = 0x3f800000`, so 63 leaves a small safety band. For BabyBear, `(p-1)/2 = 60·2^24` exactly — 63 would let `y[6]` straddle `p/2` and reintroduce the ambiguity. Also corrects the stale comment that previously had the convention reversed (claimed write ⇒ lower half, opposite of what the host code does). ## Change Highlights ### `ceno_zkvm/src/tables/shard_ram.rs` — chip-level y-sign binding - `ShardRamRecord::to_ec_point`: reject `y6 == 0` and try the next `nonce`. Classify with strict `y6 > prime / 2` so the boundary `(p-1)/2` correctly stays in the read region (a previous draft used `>=` which misclassified that single boundary value and would have produced an out-of-range `y6_lo` for both branches). - `ShardRamConfig`: new field `y6_lo_bytes: [WitIn; 4]`. - `ShardRamConfig::configure`: replace the TODO with the byte decomposition, byte-range / LTU lookups, and the `condition_require_equal` branch equality. - `ShardRamCircuit::assign_instance`: compute `y6_lo` from `y[6]` and `is_to_write_set` via a small `y6_lo_value` helper, assign byte limbs, register byte and LTU multiplicities. - New test `test_shard_ram_y_sign_circuit_rejects_negation` drives `assign_instances_with_lk_multiplicities` + `MockProver` over one honest row and one sign-flipped row, asserting `lookup_Ltu` rejects the tampered witness. A concrete challenge is supplied so the no-challenge `run` path doesn't drop `structural_witin`. ### Lookup-multiplicity plumbing for ShardRam ShardRam's per-row y6_lo byte / LTU lookups must reach `combined_lk_mlt` so the U8 / LTU table `mlt` columns balance. ShardRam runs after opcode + dummy circuits, before `finalize_lk_multiplicities`. To surface mlt without burdening every other table circuit: - `ceno_zkvm/src/tables/mod.rs`: `TableCircuit` trait gains a second default-unimplemented method `assign_instances_with_lk_multiplicities` alongside the existing `assign_instances`. ShardRam overrides the former; every other table keeps overriding the latter. - `ceno_zkvm/src/structs.rs`: `ZKVMWitnesses::assign_shared_circuit` threads a `LkMultiplicity::default()` through ShardRam's parallel-chunk witgen and inserts `lk_multiplicity.into_finalize_result()` into `lk_mlts["ShardRamCircuit"]` before finalize. Asserts swap from `combined_lk_mlt.is_some()` to `is_none()` to lock the ordering. `assign_table_circuit` tolerates `combined_lk_mlt = None` by passing an empty multiplicity slice, so `LocalFinalCircuit` (which ignores the argument anyway) can also run before finalize. - `ceno_zkvm/src/e2e.rs`: move `MmuConfig::assign_continuation_circuit` (LocalFinal + ShardRam) to just before `finalize_lk_multiplicities`. Mirror the move inside the GPU debug-compare block so `combined_lk_mlt` diff stays meaningful. - `ceno_zkvm/src/instructions/riscv/rv32im/mmu.rs`: docstring updated to describe the new ordering invariant. ### Device-resident GPU shortcut for ShardRam (mlt mirror) `ZKVMWitnesses::try_assign_shared_circuit_gpu` dispatches into `instructions::gpu::chips::shard_ram::try_gpu_assign_shared_circuit` to keep the continuation EC computation device-resident (`gpu_batch_continuation_ec_on_device` + `merge_and_partition_records`) when `is_gpu_witgen_enabled()`. The GPU kernels never enter the CPU `assign_instance` per-row push, so the y6_lo lookup multiplicity is derived host-side: - After step 6 of `try_gpu_assign_shared_circuit` (merge+partition), D2H `partitioned_buf` once to `Vec<u32>` and walk it with stride `record_u32s = 26` (`GpuShardRamRecord` `#[repr(C)]` layout). Per record extract `is_to_write_set` (u32 offset 10) and `point_y[6]` (u32 offset 25), compute `y6_lo`, push the same 4 lookup queries the CPU path emits per row, then `into_finalize_result()` and return alongside the chunked `Vec<ChipInput<E>>`. `debug_assert_eq!(record_u32s, 26)` guards against `ceno_gpu` layout drift. - `try_assign_shared_circuit_gpu` inserts both `ChipInput` and the derived multiplicity into `self.witnesses` / `self.lk_mlts["ShardRamCircuit"]` so finalize folds the GPU-path contribution into `combined_lk_mlt` the same way the CPU shortcut does. ### Verifier: account for `has_ecc_ops` row doubling `ShardRamCircuit::has_ecc_ops()` adds an extra hypercube variable; the chip matrix has `2 * next_pow2(num_instance)` rows where the back half is EC-tree internal nodes with `selector_zero = 0`. Before this fix the chip had `num_lks = 0`, so the verifier's `dummy_table_item_multiplicity` correction never had to consider it. With the new byte/LTU queries the correction under-counted dummy lookups by a factor of 2 and shard verification failed with `logup_sum != 0`. - `ceno_zkvm/src/scheme/verifier.rs`: multiply `next_pow2_instance` by 2 when `circuit_vk.get_cs().has_ecc_ops()`. - `ceno_recursion/src/zkvm_verifier/verifier.rs`: mirror the same adjustment in the recursive verifier (lockstep per CLAUDE.md). ## Benchmark / Performance Impact Per ShardRam row this PR adds **4 byte WitIn columns** plus 3 byte-range and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard RAM events, not with cycles, so the absolute cost is sub-percent on the prover. No full prover bench was rerun (no hot-loop arithmetic changed). Existing `test_shard_ram_circuit` (170k reads + 1420 writes, full chip proof) runtime is unchanged within noise: ```text master : ~5.0 s this PR : ~5.0 s ``` ## Testing ```sh cargo fmt --all --check cargo check --workspace --all-targets cargo check --workspace --all-targets --release cargo make clippy cargo clippy --workspace --all-targets --release -- -D warnings RUST_MIN_STACK=33554432 cargo test --workspace --lib --release cargo run --release --package ceno_zkvm --features sanity-check --bin e2e -- \ --platform=ceno --max-cycle-per-shard=20000 --hints=10 --public-io=4191 \ examples/target/riscv32im-ceno-zkvm-elf/release/examples/fibonacci ``` All pass locally on BabyBear. `test_shard_ram_circuit` and `test_shard_ram_y_sign_circuit_rejects_negation` are green. End-to-end multi-shard fibonacci verifies `ShardRamCircuit` and `LocalRAMTableFinal` on every shard with `exit code 0. Success.` `cargo make tests` / `cargo make tests_goldilock` should be re-run by CI; the change is gated to BabyBear via a `debug_assert_eq!` on `MODULUS_U64` and goldilocks does not exercise shard_ram (per `integration.yml` commented-out lines and CLAUDE.md). ## Risks and Rollout - **Soundness.** Closes #1338. The new constraint only adds local byte arithmetic and existing lookups — no change to transcript, sumcheck, PCS, or EC accumulation. Recursive and native verifiers move in lockstep (the `has_ecc_ops` row-factor fix lands in both). - **GPU.** The device-resident GPU shortcut now derives the y6_lo lookup multiplicity host-side from the merged partitioned device buffer (single D2H of ~26 u32 × records). Layout assumption is guarded by `debug_assert_eq!(record_u32s, 26)` against `ceno_gpu::GpuShardRamRecord`. CPU + GPU paths converge on the same `combined_lk_mlt` contribution; runtime verification with `CENO_GPU_ENABLE_WITGEN=1 --features gpu` on a CUDA host is recommended before tag. - **Recursion.** The recursive verifier mirrors the native verifier's `has_ecc_ops` × 2 row adjustment; no separate constraint-system change is needed for the y-sign binding itself. - **Field support.** Hardcodes the BabyBear constant `M = 60`. A `debug_assert_eq!(MODULUS_U64, 0x78000001, ...)` guards against accidental use on a different field; shard_ram is BabyBear-only today per CLAUDE.md. ## Follow-ups - The remaining #1340 TODOs (`local read ⇄ global write` pairing on `shard_ram.rs:235-236`, `shard == shard_id` binding on line 244) are intentionally out of scope here. Fixes #1338. Partially addresses #1340. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
## Problem left-over from #923. CPU trace commit cloned large witness MLEs before proving, adding avoidable memory traffic on the prover hot path. ## Design Rationale Keep committed witness MLEs behind `Arc` and drain/transport ownership where possible, avoiding deep clones without changing proof semantics. ## Change Highlights - `ceno_zkvm`: return `Arc` witness MLEs from trace commit and consume structural MLEs during transport. - `ceno_zkvm`: keep GPU trait shape aligned while preserving existing GPU behavior. ## Benchmark / Performance Impact ### Operation | Operation | master (s) | this PR (s) | Improve (master -> this PR) | |-----------|------------|-------------|-----------------------------| | CPU proving, keccak e2e shard total | 6.942 | 6.596 | 4.98% faster | | GPU proving, keccak e2e shard total | 1.191 | 1.186 | 0.44% faster | ### Layer | Layer | master (s) | this PR (s) | Improve (master -> this PR) | |-------|------------|-------------|-----------------------------| | N/A: shard-level proving total measured | N/A | N/A | no regression observed | Benchmark command(s): ```sh cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` Environment: local x86_64 Linux, release build, local `../ceno-gpu/cuda_hal` patch for GPU validation. raw data: - master: CPU shards `3.272s + 3.670s`; GPU shards `0.624s + 0.568s` - this PR: CPU shards `3.336s + 3.260s`; GPU shards `0.593s + 0.593s` ## Testing ```sh cargo check --config net.git-fetch-with-cli=true --package ceno_zkvm --bin e2e cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall cargo run --config net.git-fetch-with-cli=true --features gpu --release --package ceno_zkvm --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall ``` ## Risks and Rollout Low risk: prover-side ownership change only. Rollback is reverting the `Arc` witness-MLE plumbing. ## Follow-ups (optional) None. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## What A self-hosted **GPU CI runner** for Ceno, plus a workflow that runs GPU integration tests on it. ## `ci/gpu-runner/` — the runner - **Dockerfile**: `nvidia/cuda:12.8-devel-ubuntu24.04` + actions runner + pinned Rust toolchain. No secrets baked in. - **entrypoint.sh**: mints a fresh registration token from a PAT each start (survives restarts); runs an **ephemeral** runner (clean container per job). - **start-runner.sh**: builds + `docker run --gpus all`; persists cargo registry and `CARGO_TARGET_DIR` on volumes for warm rebuilds. - **watchdog.sh**: cron-driven; restarts the container on stop/crash or GPU-unreachable. `flock`-guarded against overlap. - **README.md**: host setup, secrets, cron registration. ## `.github/workflows/gpu-integration.yml` — the test Proves an example (default `keccak_syscall`) with `--features gpu` on the `gpu`-labeled runner. Triggered manually (`workflow_dispatch`) or by a `gpu-ci` PR label. Release-only by design (debug + release would compile the workspace twice). Steps: load `CENO_GPU_DEPLOY_KEY` → clone private `ceno-gpu` (`--recurse-submodules`) → activate its Cargo `[patch]` → build + prove. ## Secrets - **`GITHUB_PAT`** (host `runner.env`, gitignored): registers the runner. - **`CENO_GPU_DEPLOY_KEY`** (repo secret): read-only deploy key for the private `ceno-gpu` backend, loaded per-job via ssh-agent. No existing workflows are modified. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
#1349) Closes #356. ## Summary - Add `ZKVMVerifyingKey::compute_digest` / `ZKVMProvingKey::compute_vk_digest`: one `bincode::serialize` pass over the VK preimage fed to a fresh `DefaultChallenger` (Poseidon sponge), squeezing `VK_DIGEST_LEN = 2` extension-field elements. - Absorb those felts into the protocol transcript at the start of both prover and verifier (before public-input absorption). - Mirror the same absorption in the recursion verifier with the digest baked in as in-circuit `builder.constant(...)` elements. - Mark `ConstraintSystem.debug_map` (non-deterministic `HashMap`) and `ZKVMVerifyingKey.circuit_index_to_name` (recoverable from `circuit_vks` lex order, kept live for verifier error labels) `#[serde(skip)]` so neither contaminates the canonical preimage bytes. A swapped or mutated VK now flips the very first Fiat-Shamir challenge, and verification rejects at the first sumcheck claim. ## Files touched | File | Change | | --- | --- | | `ceno_zkvm/src/structs.rs` | digest helpers + `#[serde(skip)] circuit_index_to_name` | | `ceno_zkvm/src/scheme/prover.rs` | absorb `vk_digest` into transcript | | `ceno_zkvm/src/scheme/verifier.rs` | absorb `vk_digest` into transcript | | `ceno_recursion/src/zkvm_verifier/verifier.rs` | mirror absorb with build-time constants | | `gkr_iop/src/circuit_builder.rs` | `#[serde(skip)] debug_map` | | `ceno_zkvm/Cargo.toml` | add `poseidon` direct dep | ## Test plan - [x] `cargo fmt --all --check` - [x] `cargo check --workspace --all-targets` (debug + release) - [x] `cargo make clippy` + release clippy - [x] `cargo make tests` — 0 failed - [x] `cargo make tests_goldilock` — 0 failed - [x] Local integration e2e (mirror of `.github/workflows/integration.yml`, 24 steps: fibonacci / ceno_rt_alloc / keccak / secp256k1 / bn254 / k256 / p256 / uint256 / sha / aggregation e2e) — all 24 STEP OK; the aggregation e2e validates the recursion mirror end-to-end. ## Compatibility Proof format changes; any cached proofs / `vk_bytes` artifacts must be regenerated. README marks the project pre-production, so a breaking proof-format change is acceptable. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
## Problem Dynamic RAM init could write non-zero values into committed witness padding rows. That violates the expected zero-padding invariant for RMM witness commitments and causes CPU shard verification to fail with `InvalidPcsOpen`. ## Design Rationale Keep the existing padded CPU proving flow and add the missing dynamic-init guard so only real instances populate committed witness rows. This preserves the RMM zero-padding contract without changing verifier logic. Related: scroll-tech/gkr-backend#62 ## Change Highlights - `ceno_zkvm`: guard dynamic RAM init witness writes with `i < num_instances` so padded witness rows remain zero. - No verifier changes. ## Benchmark / Performance Impact ### Operation CPU benchmark: block `23587691`, shard `0`, `CENO_MAX_CELL_PER_SHARD=805306368`. | Operation | baseline | this PR (s) | Improve (before -> this PR) | |-----------|-----------------------------------|-------------|-----------------------------| | reth-block | 135 | 124 | 8.1% | | app.prove | 134 | 123 | 8.2% | | app.verify | 0.300 | 0.288 | 4.0% | ### Layer | Layer | before: no guard + remote gkr (s) | this PR (s) | Improve (before -> this PR) | |-------|-----------------------------------|-------------|-----------------------------| | create_proof_of_shard | 130 | 119 | 8.5% | | commit_traces | 17.2 | 10.3 | 40.1% | | prove_batched_main_constraints | 32.4 | 32.0 | 1.2% | | pcs_opening | 36.8 | 33.9 | 7.9% | Benchmark command(s): ```sh CENO_MAX_CELL_PER_SHARD=805306368 \ OUTPUT_PATH=metrics_23587691_shard0_cpu_dynamic_guard_no_gkrzero_maxcell805306368_20260608.json \ RUST_LOG=info \ target/release/ceno-reth-benchmark-bin \ --block-number 23587691 \ --chain-id 1 \ --cache-dir block_data \ --mode prove-app \ --app-proofs ./app_proof.bitcode \ --shard-id 0 ``` Baseline used the same benchmark with Ceno patched to `902b3e3c^` (`7d8086c2`) and remote `gkr-backend` tag `v1.0.0-alpha.31`. Environment (CPU/GPU, core count, rust toolchain, commit hash): - CPU shard run, `rustc 1.93.0-nightly (07bdbaedc 2025-11-19)` - before: Ceno `7d8086c2`, remote `gkr-backend v1.0.0-alpha.31` - this PR: Ceno `902b3e3c` raw data: - before: `sanity_23587691_shard0_cpu_fresh_baseline_remote_gkr_no_guard_maxcell805306368_20260608.log`, failed verification with `InvalidPcsOpen` - this PR: `sanity_23587691_shard0_cpu_dynamic_guard_no_gkrzero_maxcell805306368_20260608.log`, passed verification ## Testing ```sh cargo check --package ceno_zkvm CENO_MAX_CELL_PER_SHARD=805306368 target/release/ceno-reth-benchmark-bin --block-number 23587691 --chain-id 1 --cache-dir block_data --mode prove-app --app-proofs ./app_proof.bitcode --shard-id 0 ``` ## Risks and Rollout Low risk: the change only skips dynamic RAM init writes for rows outside `num_instances`. Rollback is reverting this guard. ## Follow-ups (optional) None. ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly.
## Problem `HintsTable` is the **only** RAM table with prover-witnessed (non-zero) init values — it holds the guest's private hint inputs. `DynVolatileRamTableInitConfig::construct_circuit` created those init-value limbs as raw `WitIn`s **with no range check**, so a malicious prover could supply non-canonical limbs (`>= 2^16`). The load path reads memory via `UInt::new_unchecked` (e.g. `LW` in `load_v2.rs`) and forwards the limbs to the destination register unconstrained, so a non-canonical hint word would propagate into the computation. This is an **under-constraint soundness bug**: a crafted proof using out-of-range hint limbs would verify. ## Design Rationale Bind every non-zero-init limb to `LIMB_BITS` (u16) in `construct_circuit`, making each reconstructed hint word a canonical u32 — the same range discipline every other prover-supplied word in the system already follows. The fix is on the constraint side (the load-bearing surface for soundness); per-row addresses are already formula-bound structural witins and the table base (`hint_start_addr`) is already range-checked by the verifier's `validate_mem_state`, so this closes the one remaining unconstrained axis (the init **value** limbs). A range check is a *looking* lookup, which the default `TableCircuit::build_gkr_iop_circuit` does not wire for table circuits, so the fix needs three coordinated touches (below). For zero-init tables (heap, stack) the looking lengths are 0, so all wiring stays **byte-for-byte identical** to the current behaviour. ## Change Highlights - **`ceno_zkvm/tables/ram/ram_impl.rs`** — range-check every non-zero-init limb (`assert_ux::<LIMB_BITS>`) in `construct_circuit`; thread an optional `LkMultiplicity` through `assign_instances` / `assign_instances_dynamic` so the per-limb u16 lookups are recorded. Adds the #999 soundness regression test. - **`ceno_zkvm/tables/ram/ram_circuit.rs`** — `DynVolatileRamCircuit` overrides `build_gkr_iop_circuit` to size the r/w/**lk**/zero out-eval groups with looking + table lengths; adds `assign_instances_with_lk_multiplicities`. - **`ceno_zkvm/structs.rs`** — new `ZKVMWitnesses::assign_table_circuit_with_lk` (mirrors `assign_shared_circuit`) to fold a table circuit's own lookup multiplicity into `lk_mlts` before finalize. - **`ceno_zkvm/instructions/riscv/rv32im/mmu.rs`** — route `HintsInitCircuit` through `assign_table_circuit_with_lk`; `HeapInitCircuit` stays on the plain path (zero-init, no lookups). - **`ceno_zkvm/e2e.rs`** — move the dynamic-init-table assignment **before** `finalize_lk_multiplicities` so the hint range-check lookups land in `combined_lk_mlt`, matching the existing pre-finalize ShardRam ordering (main + GPU-debug-compare paths). Rebased onto current `master`; merged with #1350's dynamic-init padding guard so the u16 multiplicity is recorded only for real rows (`i < num_instances && let Some(rec) = rec_opt`), consistent with the prefix selector that gates the range-check constraints. ## Benchmark / Performance Impact Not benchmarked — the change adds only `O(hint_words)` u16 range-lookups during witness generation and leaves the verifier's structural shape unchanged (zero-init tables produce identical circuits). Prover impact is negligible relative to the per-shard logup it already performs. ## Testing ```sh cargo check --workspace --all-targets cargo make clippy # -D warnings cargo test -p ceno_zkvm tables::ram::ram_impl # incl. new #999 regression ``` - **New soundness regression** `test_hint_init_rejects_non_canonical_limb`: honest (canonical) limbs satisfy every range lookup; forcing any single init limb to `2^LIMB_BITS` makes the `init_v_limb_{i}_in_u16` lookup fall outside the u16 table and `MockProver` rejects the witness. **The test fails if the range check is removed.** - Real (non-mock) prove + verify of **fibonacci with hints**, single shard and 61-shard — the global logup balances including the hint circuit's range-check records (a dropped lookup would unbalance the recorded multiplicity). - Integration e2e suite (mirror of `.github/workflows/integration.yml`); the fibonacci `--hints` MOCK path is confirmed passing on this branch. ## Risks and Rollout - **Soundness:** strictly tightening — adds a constraint, removes an under-constraint. No verifier semantic-contract change. - **Compatibility:** zero-init tables (heap/stack) are wired identically; only `HintsTable` gains the lk group. - **Rollback:** revert the commit; no migrations or persisted state. ## Follow-ups (optional) - None required for soundness. (Optional defense-in-depth, separate from this PR: make the u32 arithmetic in the verifier's `validate_mem_state` hint/heap bound explicit via `checked_add`/`checked_mul`.) ## Copilot Reviewer Directive (keep this section) When Copilot reviews this PR, apply `.github/copilot-instructions.md` strictly. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Resolve merge conflicts from merging origin/master (HintsTable range-check fix) into feat/recursion-v2 (forked transcript + bump-p3). Take HEAD's versions for prover/verifier architecture and p3 API renames (from_canonical_* -> from_*, FieldAlgebra -> PrimeCharacteristicRing). gkr-backend pinned to feat/bump-p3 branch, p3-field bumped to 0.4.3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
feat/bump-p3branch, bump p3-field to 0.4.3from_canonical_*→from_*,FieldAlgebra→PrimeCharacteristicRingNote
Build has remaining structural API mismatches (
from_canonical_*in non-conflicted files, missing struct fields) that need follow-up work.🤖 Generated with Claude Code