Conversation
…urce V2 reader Implement SupportsMetadataColumns on ShapefileTable so that reading shapefiles into a DataFrame exposes the standard _metadata hidden struct containing file_path, file_name, file_size, file_block_start, file_block_length, and file_modification_time. Changes across all four Spark version modules (3.4, 3.5, 4.0, 4.1): - ShapefileTable: mix in SupportsMetadataColumns, define the _metadata MetadataColumn with the standard six-field struct type - ShapefileScanBuilder: override pruneColumns() to capture the pruned metadata schema requested by Spark's column pruning optimizer - ShapefileScan: accept metadataSchema, override readSchema() to append metadata fields, pass schema to partition reader factory - ShapefilePartitionReaderFactory: construct metadata values from the .shp PartitionedFile, wrap the base reader in PartitionReaderWithMetadata that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat that joins dat es that joins dat that joins dat that joins dat that joins dat tion.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements support for Spark's standard _metadata hidden column in the Shapefile DataSource V2 reader. The _metadata column provides file-level information (path, name, size, block offsets, modification time) and is consistent with other Spark file-based data sources like Parquet and ORC.
Changes:
- Implemented
SupportsMetadataColumnsinterface across ShapefileTable classes to expose the_metadatacolumn - Modified scan builder and scan classes to capture and propagate metadata schema through Spark's column pruning pipeline
- Created
PartitionReaderWithMetadatawrapper class to append metadata values to each row usingJoinedRowand unsafe projection - Added 11 comprehensive test cases per Spark version validating schema, hidden column semantics, value correctness, multi-file behavior, filtering, and projection
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/spark-{3.4,3.5,4.0,4.1}/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala | Added 11 test cases for _metadata column functionality; removed unrelated partition filter test; changed var to val; added unused Row import |
| spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileTable.scala | Implemented SupportsMetadataColumns interface and defined _metadata column schema with 6 standard fields |
| spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScanBuilder.scala | Added pruneColumns override to capture metadata schema from Spark's column pruning optimizer |
| spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefileScan.scala | Added metadataSchema parameter and overrode readSchema to include metadata fields |
| spark/spark-{3.4,3.5,4.0,4.1}/src/main/scala/.../ShapefilePartitionReaderFactory.scala | Implemented metadata value construction and PartitionReaderWithMetadata wrapper class |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Show resolved
Hide resolved
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Show resolved
Hide resolved
spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Outdated
Show resolved
Hide resolved
spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Outdated
Show resolved
Hide resolved
spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala
Show resolved
Hide resolved
…on filter tests - Remove unused org.apache.spark.sql.Row import from ShapefileTests in all 4 Spark versions (3.4, 3.5, 4.0, 4.1) - Restore accidentally removed partition filter test code in spark-3.5, 4.0, and 4.1 (use val filteredRows instead of var rows reassignment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[SEDONA-XXX] my subject.This PR fixes SEDONA-729.
What changes were proposed in this PR?
When reading shapefiles via the DataSource V2 API, the standard
_metadatahidden column (containingfile_path,file_name,file_size,file_block_start,file_block_length,file_modification_time) was missing from the DataFrame. This is becauseShapefileTabledid not implement Spark'sSupportsMetadataColumnsinterface.This PR implements
_metadatasupport across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module:SupportsMetadataColumnsand defines the_metadataMetadataColumnwith the standard six-field struct type.pruneColumns()to capture the pruned metadata schema requested by Spark's column pruning optimizer.metadataSchemaparameter, overridesreadSchema()to append metadata fields to the output schema, and passes the schema to the partition reader factory..shpPartitionedFile, and wraps the base reader in aPartitionReaderWithMetadatathat joins data rows with metadata usingJoinedRow+GenerateUnsafeProjection. Correctly handles Spark's struct pruning by building only the requested sub-fields.After this change, users can query
_metadataon shapefile DataFrames just like Parquet/ORC/CSV:How was this patch tested?
11 new test cases added to
ShapefileTests(per Spark version module) covering:_metadatastruct contains all 6 expected fields with correct types_metadatadoes not appear inselect(*)but can be explicitly selectedfile_path,file_name,file_size,file_block_start,file_block_length, andfile_modification_timeare verified against actual filesystem values usingjava.io.FileAPIs_metadatafields can be used inWHEREclauses_metadatafields can be selected alongside data columnsAll tests pass on all four Spark versions:
Did this PR include necessary documentation updates?
_metadatacolumn is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced.