Skip to content

[SEDONA-729] Add _metadata hidden column support for shapefile DataSource V2 reader#2653

Merged
jiayuasu merged 2 commits intomasterfrom
fix/SEDONA-729-shapefile-metadata-columns
Feb 15, 2026
Merged

[SEDONA-729] Add _metadata hidden column support for shapefile DataSource V2 reader#2653
jiayuasu merged 2 commits intomasterfrom
fix/SEDONA-729-shapefile-metadata-columns

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 15, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [SEDONA-XXX] my subject.

This PR fixes SEDONA-729.

What changes were proposed in this PR?

When reading shapefiles via the DataSource V2 API, the standard _metadata hidden column (containing file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) was missing from the DataFrame. This is because ShapefileTable did not implement Spark's SupportsMetadataColumns interface.

This PR implements _metadata support across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:

  1. ShapefileTable — Mixes in SupportsMetadataColumns and defines the _metadata MetadataColumn with the standard six-field struct type.
  2. ShapefileScanBuilder — Overrides pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer.
  3. ShapefileScan — Accepts the metadataSchema parameter, overrides readSchema() to append metadata fields to the output schema, and passes the schema to the partition reader factory.
  4. ShapefilePartitionReaderFactory — Constructs metadata values (path, name, size, block offset/length, modification time) from the .shp PartitionedFile, and wraps the base reader in a PartitionReaderWithMetadata that joins data rows with metadata using JoinedRow + GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.

After this change, users can query _metadata on shapefile DataFrames just like Parquet/ORC/CSV:

val df = spark.read.format("shapefile").load("/path/to/shapefiles")
df.select("geometry", "_metadata.file_name", "_metadata.file_size").show()
df.filter($"_metadata.file_name" === "specific.shp").show()

How was this patch tested?

11 new test cases added to ShapefileTests (per Spark version module) covering:

  • Schema validation: _metadata struct contains all 6 expected fields with correct types
  • Hidden column semantics: _metadata does not appear in select(*) but can be explicitly selected
  • Value correctness: file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time are verified against actual filesystem values using java.io.File APIs
  • Multi-file behavior: metadata values are correct per-file when reading a directory of shapefiles
  • Filtering: _metadata fields can be used in WHERE clauses
  • Projection: _metadata fields can be selected alongside data columns

All tests pass on all four Spark versions:

  • spark-3.4 (Scala 2.12): 53 tests passed
  • spark-3.5 (Scala 2.12): 33 tests passed
  • spark-4.0 (Scala 2.13): 33 tests passed
  • spark-4.1 (Scala 2.13): 33 tests passed

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation. The _metadata column is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.

…urce V2 reader

Implement SupportsMetadataColumns on ShapefileTable so that reading
shapefiles into a DataFrame exposes the standard _metadata hidden struct
containing file_path, file_name, file_size, file_block_start,
file_block_length, and file_modification_time.

Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1):

- ShapefileTable: mix in SupportsMetadataColumns, define the _metadata
  MetadataColumn with the standard six-field struct type
- ShapefileScanBuilder: override pruneColumns() to capture the pruned
  metadata schema requested by Spark's column pruning optimizer
- ShapefileScan: accept metadataSchema, override readSchema() to append
  metadata fields, pass schema to partition reader factory
- ShapefilePartitionReaderFactory: construct metadata values from the
  .shp PartitionedFile, wrap the base reader in PartitionReaderWithMetadata
  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  that joins dat  es   that joins dat  that joins dat  that joins dat  that joins dat  tion.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements support for Spark's standard _metadata hidden column in the Shapefile DataSource V2 reader. The _metadata column provides file-level information (path, name, size, block offsets, modification time) and is consistent with other Spark file-based data sources like Parquet and ORC.

Changes:

  • Implemented SupportsMetadataColumns interface across ShapefileTable classes to expose the _metadata column
  • Modified scan builder and scan classes to capture and propagate metadata schema through Spark's column pruning pipeline
  • Created PartitionReaderWithMetadata wrapper class to append metadata values to each row using JoinedRow and unsafe projection
  • Added 11 comprehensive test cases per Spark version validating schema, hidden column semantics, value correctness, multi-file behavior, filtering, and projection

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
spark/spark-{3.4,3.5,4.0,4.1}/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala Added 11 test cases for _metadata column functionality; removed unrelated partition filter test; changed var to val; added unused Row import
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileTable.scala Implemented SupportsMetadataColumns interface and defined _metadata column schema with 6 standard fields
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScanBuilder.scala Added pruneColumns override to capture metadata schema from Spark's column pruning optimizer
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScan.scala Added metadataSchema parameter and overrode readSchema to include metadata fields
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefilePartitionReaderFactory.scala Implemented metadata value construction and PartitionReaderWithMetadata wrapper class

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…on filter tests

- Remove unused org.apache.spark.sql.Row import from ShapefileTests in all 4
  Spark versions (3.4, 3.5, 4.0, 4.1)
- Restore accidentally removed partition filter test code in spark-3.5, 4.0,
  and 4.1 (use val filteredRows instead of var rows reassignment)
@jiayuasu jiayuasu added this to the sedona-1.9.0 milestone Feb 15, 2026
@jiayuasu jiayuasu merged commit 0e52724 into master Feb 15, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant