Skip to content

fix: use name-mapped field IDs for projection and predicate pushdown#2612

Open
viirya wants to merge 1 commit into
apache:mainfrom
viirya:fix/2403-name-mapping-projection
Open

fix: use name-mapped field IDs for projection and predicate pushdown#2612
viirya wants to merge 1 commit into
apache:mainfrom
viirya:fix/2403-name-mapping-projection

Conversation

@viirya

@viirya viirya commented Jun 10, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

What changes are included in this PR?

When a Parquet file lacks embedded field IDs but a name mapping (schema.name-mapping.default) is available, the reader applies the mapping to the Arrow schema — but it still planned column projection and predicate pushdown with the position-based fallback (field id N → column N-1). As a result:

  • Projected columns whose mapped field IDs don't line up with their physical positions were silently read as all-NULL.
  • Predicates were evaluated against the wrong physical columns, silently returning wrong rows (this also makes row-group pruning and row selection prune incorrectly).

This contradicts Java's ReadConf, which uses a three-branch strategy: embedded IDs → field-id projection; name mapping → apply mapping, then field-id projection; neither → position fallback. Our code applied the name mapping to the Arrow schema but then took the fallback branch anyway.

Changes:

  • pipeline.rs: compute use_position_fallback = missing_field_ids && task.name_mapping.is_none() and pass it (instead of missing_field_ids) to both get_arrow_projection_mask and build_field_id_set_and_map, so position fallback only applies when there are no embedded field IDs and no name mapping.
  • projection.rs: build_field_id_set_and_map now distinguishes the name-mapping case. When the Parquet descriptor has no embedded field IDs but a name mapping assigned IDs to the Arrow schema, it builds the field-id → leaf-column map from the Arrow schema's PARQUET:field_id metadata (new build_field_id_map_from_arrow_schema). Arrow leaf ordering matches Parquet leaf column ordering (both depth-first), the same invariant get_arrow_projection_mask already relies on for ProjectionMask::leaves.

Are these changes tested?

Two new regression tests in projection.rs, covering both broken paths:

  • test_read_parquet_with_name_mapping_uses_mapped_field_ids: a file without field IDs whose columns [name, subdept] map to non-contiguous field IDs (2, 4). On main, projection NULL-fills both columns; with this fix, the values are read correctly.
  • test_predicate_on_name_mapped_file_uses_mapped_field_ids: a predicate name = "Alice" on the same file shape, with row-group filtering and row selection enabled. On main, the predicate is evaluated against the wrong physical column and returns 2 rows; with this fix, it returns the single correct row.

Both tests fail on main (verified by reverting the pipeline.rs strategy line) and pass with the fix. The full arrow:: unit-test suite (97 tests) passes.

When a Parquet file lacks embedded field IDs but a name mapping
(schema.name-mapping.default) is available, the reader applied the
mapping to the Arrow schema yet still planned column projection and
predicate pushdown with the position-based fallback (field id N ->
column N-1). Columns whose mapped field IDs do not line up with their
physical positions were read as NULL, and predicates were evaluated
against the wrong physical columns, silently returning wrong rows.

Position fallback now applies only when the file has no embedded field
IDs AND no name mapping is available, matching Java's three-branch
ReadConf strategy. With a name mapping, projection and predicate
planning resolve columns through the field IDs the mapping assigned to
the Arrow schema.

Closes apache#2403
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Parquet name mapping still uses position fallback for projection and predicate pushdown

1 participant