[SEDONA-729] Add _metadata hidden column support for shapefile DataSource V2 reader by jiayuasu · Pull Request #2653 · apache/sedona

jiayuasu · 2026-02-15T07:09:43Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Developer Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [SEDONA-XXX] my subject.

This PR fixes SEDONA-729.

What changes were proposed in this PR?

When reading shapefiles via the DataSource V2 API, the standard _metadata hidden column (containing file_path, file_name, file_size, file_block_start, file_block_length, file_modification_time) was missing from the DataFrame. This is because ShapefileTable did not implement Spark's SupportsMetadataColumns interface.

This PR implements _metadata support across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:

ShapefileTable — Mixes in SupportsMetadataColumns and defines the _metadata MetadataColumn with the standard six-field struct type.
ShapefileScanBuilder — Overrides pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer.
ShapefileScan — Accepts the metadataSchema parameter, overrides readSchema() to append metadata fields to the output schema, and passes the schema to the partition reader factory.
ShapefilePartitionReaderFactory — Constructs metadata values (path, name, size, block offset/length, modification time) from the .shp PartitionedFile, and wraps the base reader in a PartitionReaderWithMetadata that joins data rows with metadata using JoinedRow + GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.

After this change, users can query _metadata on shapefile DataFrames just like Parquet/ORC/CSV:

val df = spark.read.format("shapefile").load("/path/to/shapefiles")
df.select("geometry", "_metadata.file_name", "_metadata.file_size").show()
df.filter($"_metadata.file_name" === "specific.shp").show()

How was this patch tested?

11 new test cases added to ShapefileTests (per Spark version module) covering:

Schema validation: _metadata struct contains all 6 expected fields with correct types
Hidden column semantics: _metadata does not appear in select(*) but can be explicitly selected
Value correctness: file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time are verified against actual filesystem values using java.io.File APIs
Multi-file behavior: metadata values are correct per-file when reading a directory of shapefiles
Filtering: _metadata fields can be used in WHERE clauses
Projection: _metadata fields can be selected alongside data columns

All tests pass on all four Spark versions:

spark-3.4 (Scala 2.12): 53 tests passed
spark-3.5 (Scala 2.12): 33 tests passed
spark-4.0 (Scala 2.13): 33 tests passed
spark-4.1 (Scala 2.13): 33 tests passed

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation. The _metadata column is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.

…urce V2 reader Implement SupportsMetadataColumns on ShapefileTable so that reading shapefiles into a DataFrame exposes the standard _metadata hidden struct containing file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time. Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1): - ShapefileTable: mix in SupportsMetadataColumns, define the _metadata MetadataColumn with the standard six-field struct type - ShapefileScanBuilder: override pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer - ShapefileScan: accept metadataSchema, override readSchema() to append metadata fields, pass schema to partition reader factory - ShapefilePartitionReaderFactory: construct metadata values from the .shp PartitionedFile, wrap the base reader in PartitionReaderWithMetadata that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat es that joins dat that joins dat that joins dat that joins dat tion.

Copilot

Pull request overview

This PR implements support for Spark's standard _metadata hidden column in the Shapefile DataSource V2 reader. The _metadata column provides file-level information (path, name, size, block offsets, modification time) and is consistent with other Spark file-based data sources like Parquet and ORC.

Changes:

Implemented SupportsMetadataColumns interface across ShapefileTable classes to expose the _metadata column
Modified scan builder and scan classes to capture and propagate metadata schema through Spark's column pruning pipeline
Created PartitionReaderWithMetadata wrapper class to append metadata values to each row using JoinedRow and unsafe projection
Added 11 comprehensive test cases per Spark version validating schema, hidden column semantics, value correctness, multi-file behavior, filtering, and projection

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
spark/spark-{3.4,3.5,4.0,4.1}/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala	Added 11 test cases for _metadata column functionality; removed unrelated partition filter test; changed var to val; added unused Row import
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileTable.scala	Implemented SupportsMetadataColumns interface and defined _metadata column schema with 6 standard fields
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScanBuilder.scala	Added pruneColumns override to capture metadata schema from Spark's column pruning optimizer
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScan.scala	Added metadataSchema parameter and overrode readSchema to include metadata fields
spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefilePartitionReaderFactory.scala	Implemented metadata value construction and PartitionReaderWithMetadata wrapper class

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-3.4/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala

…on filter tests - Remove unused org.apache.spark.sql.Row import from ShapefileTests in all 4 Spark versions (3.4, 3.5, 4.0, 4.1) - Restore accidentally removed partition filter test code in spark-3.5, 4.0, and 4.1 (use val filteredRows instead of var rows reassignment)

github-actions bot added the sedona-spark label Feb 15, 2026

jiayuasu requested a review from Copilot February 15, 2026 07:10

Copilot started reviewing on behalf of jiayuasu February 15, 2026 07:11 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

jiayuasu added this to the sedona-1.9.0 milestone Feb 15, 2026

jiayuasu merged commit 0e52724 into master Feb 15, 2026
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-729] Add _metadata hidden column support for shapefile DataSource V2 reader#2653

[SEDONA-729] Add _metadata hidden column support for shapefile DataSource V2 reader#2653
jiayuasu merged 2 commits intomasterfrom
fix/SEDONA-729-shapefile-metadata-columns

jiayuasu commented Feb 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jiayuasu commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiayuasu commented Feb 15, 2026 •

edited

Loading