Skip to content

Table scan rejects current-schema column names after UpdateSchemaAction commit #2565

@nazq

Description

@nazq

Table scan rejects current-schema column names after UpdateSchemaAction commit

Label: bug

Is your feature request related to a problem or challenge?

A default TableScanBuilder::build() validates caller-supplied column names against the snapshot's schema, not the table's current schema. After an UpdateSchemaAction commit changes the current schema (rename / add / delete column), pre-existing snapshots still point at the pre-evolution schema_id, so the scan rejects names that are valid against the post-evolution schema.

Reproducer

Setup: any iceberg table with at least one snapshot. Apply a schema-evolution transaction (uses the action shipped in #2120 / UpdateSchemaAction):

let tx = Transaction::new(&table);
let action = tx.update_schema()
    .add_column(AddColumn::optional("note", Type::Primitive(PrimitiveType::String)));
let tx = action.apply(tx)?;
let table = tx.commit(&catalog).await?;

The catalog now reports the post-evolution schema (verified via catalog.load_table().metadata().current_schema()). But a scan over the same Table:

table.scan().select(["note"]).build()

returns:

DataInvalid => Column note not found in table. Schema: table {
  1: id: optional long
  2: name: optional string
  3: tmp: optional double
}

The schema dump is the snapshot's schema — the column added a moment ago is missing.

Root cause

crates/iceberg/src/scan/mod.rs:221:

let schema = snapshot.schema(self.table.metadata())?;

snapshot.schema(metadata) resolves the snapshot's schema_id against metadata.schemas and returns the schema the snapshot was written under. For time-travel scans (.snapshot_id(...)) that's exactly right — the caller is asking for "the table as it existed at this snapshot." But for a default scan, the caller is asking for "the table as it is now," and the post-evolution columns are legitimately part of that vocabulary.

The downstream Parquet projection (crates/iceberg/src/arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids) already maps field IDs to on-disk column names via PARQUET:field_id metadata, so resolving names against the current schema is safe end-to-end — field IDs are stable across schema versions, and the file's original column names live in the parquet metadata until the file is rewritten. PyIceberg's reader (pyiceberg/io/pyarrow.py::_task_to_record_batches) implements exactly this pattern: project by field ID, rename the arrow batch on the way out.

Why this wasn't caught upstream

UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them call table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the column-name validation loop entirely (it falls through to column_names.unwrap_or_else(|| schema.as_struct().fields()...)).

So a column-name lookup combined with a schema-evolved table is the gap. Both add_column and delete_column (already in main) trigger it; rename_column (#2563) trips it even more cleanly because the old name continues to exist on disk.

Describe the solution you'd like

Branch on whether the caller asked for a specific snapshot:

let schema = if self.snapshot_id.is_some() {
    snapshot.schema(self.table.metadata())?
} else {
    self.table.metadata().current_schema().clone()
};
  • Explicit snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking "what existed at snapshot N" should see schema N's columns.
  • Default scan (no snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream projection still finds the right on-disk columns.

Both the column-name validation loop and the subsequent field_id_by_name lookup share the same schema variable, so the fix is one assignment.

Willingness to contribute

I can contribute this independently. I have a working branch with the fix + three regression tests (rename-then-read works, old-name-after-rename errors, time-travel still uses snapshot schema), all 1299 iceberg lib tests passing, clippy + rustfmt clean. PR ready to open once this issue is filed for reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions