-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2471: Add geometry logical type #240
base: master
Are you sure you want to change the base?
Conversation
@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it. Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet? |
Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story. |
@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.
Im new in parquet side, so had some questions
src/main/thrift/parquet.thrift
Outdated
* For now only OGC:CRS84 is supported. | ||
*/ | ||
3: optional string crs; | ||
4: required Edges edges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we support Edges=spherical, looks like we need to support 'orientation' in order to correctly interpret polygons. See: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6zbI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to correctly interpret polygons
Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a orientation
enum field with two values: counterclockwise and clockwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a very extensive discussion in GeoParquet to get to our choice of 'If present must be "counterclockwise"; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order.' See this comment: opengeospatial/geoparquet#46 (comment) from the main author of GDAL, and indeed the entire discussion on that issue.
Basically there aren't any existing specs that give 2 options, so we didn't want to be the first, but structured it so that 'clockwise' could be added later.
src/main/thrift/parquet.thrift
Outdated
* The dimension of the geometry. | ||
* For now only 2D geometry is supported and the value must be 2 if set. | ||
*/ | ||
2: optional byte dimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if its too ambitious, as bounding box would not work for 3d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can define different ColumnOrder
for 3D? Any good zone-map candidate to use for 3D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea maybe it is possible here, just min (x,y,z) and max (x,y,z) coordinates. cc @jiayuasu to check my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dimension
(in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to enum.
There is xy, xyz, xym, xyzm: https://postgis.net/docs/ST_NDims.html
And, ST_Dimenion usually refers to point, line, polygon: https://postgis.net/docs/ST_Dimension.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho what do you mean by bbox
would not work for 3d?
If the 3d
you mean is xyz
, then a 3D bbox should work.
If the 3d
you mean is spherical, then the coordinates are still 2D and a 2D bbox will not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sorry , hence my second comment pulling you in for your expertise :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dimension is xyz, how to represent the 3D bbox? Is it still enough to use the (x_min, y_min, z_min) and (x_max, y_max, z_max) pair? For xym and xyzm, do we actually need m_min, m_max in the bbox?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably expect the bounding box to include M if this dimensions value were set to XYM or XYZM (although it is probably rarely if ever used). There is some precedent for a bounding box having an independent dimensions value from the data ( http://www.geopackage.org/spec/#gpb_format ... see "envelope contents indicator").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also probably worth mentioning in the comment here that if this value is set that the values in the column must have those dimensions (I assume that was the intention of the optional
).
src/main/thrift/parquet.thrift
Outdated
* arrays do not include a length prefix. | ||
* Values are encoded using PLAIN encoding, except that: | ||
* 1) variable-length byte arrays do not include a length prefix. | ||
* 2) geometry logical type with BoundingBoxOrder uses max_value/min_value pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to have as part of statitics, but trying to wrap my head around how writer implementations in WKB case can get min_value/max_value, not sure if they have the bounding box easily unless they deserialze. I think that's where the advanced geospatial encoding can be more beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. As you can see that max_value
and min_value
here are of binary
type, which is the serialized form and cannot be consumed directly. AFAIK, the C++ and Java Parquet implementations have provided functions to access the deserialized values from the Statistics
class. We can also define an easy-to-use bounding box class and get it from Statistics
implementation, but that's not the topic of the specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to wrap my head around how writer implementations in WKB case can get min_value/max_value
We have two options here:
- If the input data is in a well-defined geometry object, collecting bounding box should be easy but the interface of the parquet library would be complicated.
- The parquet writer only accepts WKB-encoded binary data, then the writer is required to deserialize values to get the coordinates which degrade the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I guess option 1 makes sense here. In this use case, I need to think it through, but we may not even need advanced 'native' encoding options as long as we can save the bounding box in Parquet stats? cc @jiayuasu
I guess if Parquet Java writer/reader may need to depend on JTS Geometry though, which is the normal in memory representation, unless we come up with something else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, option 1 is more efficient but option 2 might be easier for different parquet impls. parquet-mr (which is the java impl of parquet) will depend on JTS and it is pretty natural to accept JTS Geometry as input data. However, for other parquet impls (e.g. parquet-cpp from arrow cpp, or parquet rust from arrow-rs), perhaps we need to leverage GeoArrow?
cc @pitrou @mapleFU @tustvold @zeroshade @etseidl to get awareness of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the native encoding is no longer really needed if the writer can do bbox itself.
Both option 1 and 2 sound fine to me. And, if the performance is NOT degraded too much, maybe Option 2 is even a better choice in terms of adoption?
src/main/thrift/parquet.thrift
Outdated
* x_max, y_min, and y_max of all coordinates from all geometry values. For simplexty, | ||
* min_value field in the Statistics/ColumnIndex is encoded as the concatenation of | ||
* PLAIN-encoded DOUBLE-typed x_min and y_min values. Similarly, max_value field is | ||
* encoded as the concatenation of PLAIN-encoded DOUBLE-typed x_max and y_max values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there possibly other encodings that might be more useful here (I seem to recall that using interleaved encoding (i.e. z-order) might be useful for searching but I'd have to go back and do some reading to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMIW, z-order or hilbert curve apply to points only. We might need to obtain a point (e.g. centroid) from geometries to apply space-filling curve but I suspect the usability compared to bounding box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. z-order or hilbert curve apply to points only. Unless we use something like XZ2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An S2 or H3 covering would be useful as a column statistic here...it provides a much better description of a "bounding" on the sphere that readers can leverage to skip reads. I'm not familiar with the limitations here but can we provide optionally all of these? (S2 and H3 are both a variable-sized list of int64
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the column statistics are the aggregate index for a group of geometry values. For example, usually the column statistics are collected from a data page of 10,000 values where bbox is more suitable. CMIW, S2/H3 covering might be more efficient in spatial join or record-level filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the suitability of a bbox vs. a covering is orthogonal to the number of values...a single bounding box is not able to represent things like Alaska or Fiji (which straddle the antimeridian) or Antarctica (which covers a pole). There are workarounds, of course: you can use two bounding boxes in some cases, use some conventions for bounding boxes that cover a pole, specify that if min_lat > max_lat that the bounding box straddles the antimeridian, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GEOJSON spec uses "southwest" and "northeast" to remove the antimeridian bbox ambiguity ( https://datatracker.ietf.org/doc/html/rfc7946#section-5 ).
A somewhat future-proof way to support discrete global grid coverings (e.g., S2, H3) would be to add an optional WKB-polygon to be specified here (i.e., quite literally a polygon that covers all the features in the row group).
@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback. |
I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string. Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information. Edit: this seems to be the approach adopted by GeoParquet as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.
In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).
There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.
It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.
src/main/thrift/parquet.thrift
Outdated
* The dimension of the geometry. | ||
* For now only 2D geometry is supported and the value must be 2 if set. | ||
*/ | ||
2: optional byte dimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dimension
(in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).
src/main/thrift/parquet.thrift
Outdated
* For now only OGC:CRS84 is supported. | ||
*/ | ||
3: optional string crs; | ||
4: required Edges edges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to correctly interpret polygons
Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).
@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT? EDIT: I think we can remove those informative attributes like |
I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.
I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu |
The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that. I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format. |
@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet. |
The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly. |
I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view). Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.
I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access. |
I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions. |
Just to ensure my understanding is correct:
|
To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story .. |
@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new |
On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586 |
Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync |
Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.
That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.
Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible. |
I think I have collected sufficient comments and suggestions to this proposal and have modified it with following changes:
|
As the discussion is getting mature, I'd like to seek opinions from parquet community. Sorry to bother but we need to be serious about format change. cc @emkornfield @gszadovszky @julienledem @pitrou @shangxinli @tustvold @xhochy |
src/main/thrift/parquet.thrift
Outdated
/** Covering of geometries as a list of Google S2 cell ids */ | ||
2: list<i64> s2_cell_ids; | ||
/** Covering of geometries as a list of Uber H3 indices */ | ||
3: list<i64> h3_indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really forward-looking to hardcode company-specific statistics in the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).
That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges
field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super familiar with thrift's compatibility, but is it possible to leave this out for now, and add additional fields later without breaking backcompat?
If that's the case, I think I would prefer leaving those out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that these are a little new for initial change to the Parquet spec (but will defer to @jiayuasu who may have a more pressing use-case for them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that S2 and H3 are probably too new to add at this time. The main idea is to show we can extend more specific types of geospatial statistics at this place.
BTW, S2 and H3 are widely used and licensed under Apache 2.0, we can remove the specific company names when we decide to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine either way (adding s2/h3 or leave them out). But if we want to add them, don't we also need to provide the precision / level of these IDs? Or one can tell the precision from the cell id itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that I have missed precision / level of these IDs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these changes! Some comments from the spatial end of things 🙂
src/main/thrift/parquet.thrift
Outdated
/** Covering of geometries as a list of Google S2 cell ids */ | ||
2: list<i64> s2_cell_ids; | ||
/** Covering of geometries as a list of Uber H3 indices */ | ||
3: list<i64> h3_indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).
That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges
field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).
src/main/thrift/parquet.thrift
Outdated
* WARNING: GeometryStatistics cannot be enabled for these encodings because | ||
* only leaf columns can have column statistics and page index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the statistics for the leaf columns contain equivalent information to the bounding box (which might be worth mentioning here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think WKB = 0 is pretty clear, but the GeoArrow ones are not so clear because they dictate that entire type must be one Geometry subtype (ie all Points, or all Lines, etc)? And so it can be limiting for users.
Maybe as you already saw, but we were chatting with @jiayuasu on the Iceberg proposal , and they had experimented with encodings like GeoLake native, that is a native encoding but re-uses same schema to represent all Geometry subtypes?
As it may be a longer debate, may be we can delay adding non-WKB ones in this first pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current PR tries to add maximum support as far as I can see (including the native encoding type and various statistics) to seek opinions. I agree that we need to remove controversial/unclear things off in the final PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as we discussed before, the current native encoding in GeoParquet (borrowed from GeoArrow) does not allow mixed types of geometries in the same column. In addition, its current design makes it hard for Iceberg to adopt. But I chatted with @paleolimbot last week, and GeoArrow community is willing to make necessary changes to make it work.
However, the main reason of having the native encoding in GeoParquet is to compute min/max statistics. If Parquet allows bbox / s2 / h3 as the native statistics, the native encoding does not seem to be necessary. This will greatly increase the adoption of Parquet / GeoParquet.
So, given the uncertainty of this topic, maybe we can remove this in the final PR.
src/main/thrift/parquet.thrift
Outdated
/** | ||
* Additional informative metadata. | ||
* It can be used by GeoParquet to offload some of the column metadata. | ||
*/ | ||
2: optional string metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I was supportive of doing this initially; however, I think this would better accomplish its goal of interoperability to explicitly spell out crs
and edges
in thrift language (as you did in a previous version of this PR). (As long are the Parquet community will be receptive to the open source geospatial community's feedback when current best practice changes, and so far it seems as though they are 🙂 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One absolutely critical reason to do this is that one cannot calculate a bounding box in the normal way if a geometry has spherical edges. Similarly, for geographic coordinates, if an edge crosses the antimeridian, the bbox will be invalid. If we include those details as part of the logical type, geo-naive writers can safely fill in column statistics with a fairly basic WKB parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'll add them back with explanation above.
Another GeoParquet member checking in: My main question/concern about this is the degree to which it forces generic Parquet implementations to be aware of and care about geospatial data types. One of the best parts of GeoParquet is that it works with any existing Parquet implementation, as long as you can access the file metadata, because it can exist as a separate layer on top of Parquet. If existing Parquet implementations are updated such that they don't crash on unknown logical types, but don't have geospatial-specific support, will it be possible for users to access this geospatial logical type information directly? Or will this require that all Parquet implementations have some baseline level of geospatial support? In the latter case, I would worry significantly about ecosystem fragmentation. |
That is a great point!
I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader. In Arrow C++ or pyarrow I believe there is access to a Parquet logical type already (more difficult than file-level metadata, which was already propagated to a the Arrow schema, albeit incorrectly sometimes if there were multiple files or a renamed column involved). The second level might be the ability to write column statistics, which would require a WKB parser. The flip side of this argument is that embedding geospatial details in the format allows Parquet to be a more effective geospatial file format for the readers/writers that do care about these details. |
This seems to be another argument in favor of an "extension type" facility in Parquet. |
Started a discussion about extension types here: https://lists.apache.org/thread/9xo3mp4n23p8psmrhso5t9q899vxwfjt |
A Parquet extension mechanism seems natural to me based on my previous work with Arrow extension types (which I think are great!), but it is also worth mentioning that the implementation work/choices is/are the same either way: implementations can choose to ignore GEOMETRY types, treat them as their storage type, add some infrastructure to inject support for reading/writing/pushing down filters into statistics, or build basic support for those things into the implementation. The ability to treat unknown logical types (extension or not) and ability to interact with Parquet files at a lower level are good things for an implementation to do regardless of the extension capability of the format. |
- remove file-level geo stats - add custom wkb-encoded geometry stats - comment out controversial items
I have just updated the PR with following changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There is room for more precisely documenting some of the parameters (e.g., using language from the GeoParquet specification); however, I think others may have some comments regarding exactly which parameters are here and so I will leave those comments for if/when there is an agreement in principle 🙂
src/main/thrift/parquet.thrift
Outdated
/** S2 spatial index: http://s2geometry.io/ */ | ||
struct S2Index { | ||
/** Level of S2 cell ids. valid range is [0, 30] */ | ||
1: required i32 level; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth removing this since it's no longer in the statistics?
Also: To my knowledge, level
is embedded in the S2/H3 cell ID, and not all cell IDs in a covering need to have the same level (this is a useful property).
src/main/thrift/parquet.thrift
Outdated
union Envelope { | ||
1: BoundingBox bbox // A bounding box of geometries if it can be built. | ||
2: Geometry covering // A covering polygon of geometries if bbox is unavailable. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid overloading various terms that are used elsewhere in the geospatial world, you probably want:
struct Covering {
optional BoundingBox bbox // A bounding box of geometries if it can be built.
optional Geometry covering // A covering polygon of geometries if bbox is unavailable.
}
Unless there's some technical limitation that providing only one of these is required, maybe use a struct
?
(The term Envelope
is defined somewhere in the simple features/SQL spec as very specifically the POLYGON representation of the bounding box)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if the optional Geometry covering
is controversial we can drop it 🙂
Thanks @paleolimbot for the detail feedback! I have changed the name and removed controversial items. I will start a PoC implementation in parquet-java later this week if there is no strong objection. |
src/main/thrift/parquet.thrift
Outdated
* Edges of the geometry if it is a polygon. It may be different to the | ||
* edges attribute from the GEOMETRY logical type. | ||
*/ | ||
2: optional Edges edges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth specifying whether this is logically required or if a default is assumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to make it required
because the geometry is anyway a polygon which has this attribute. Though writer who generates this geometry statistics is required to fill the Edges
field, it makes the reader easier in that it does not need to derive this attribute from logical type. Sometimes the values in the geometry column do not have any polygon value, in which case the logical type may not have the Edges
field. cc @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having this be required and explicitly marked as planar is appropriate (even if the column contains all points, this is a general-purpose geometry type). Explicitly marking it is better than having two ways to specify planar edges (by omission and by explicitly setting the value).
Converting between planar and spherical edges should always be an explicit operation (that happens to be very easy for points).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use bool is_spherical
instead of an enum Edges edges
given the fact that it has only two options? @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question...a boolean leaves no room for a new edge interpolation mechanism to exist (and there is one, ellipsoidal edges, that nobody uses in practice but does exist).
src/main/thrift/parquet.thrift
Outdated
} | ||
|
||
struct Covering { | ||
optional BoundingBox bbox // A bounding box of geometries if it can be built. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please clarify if these are mutually exclusive on a covering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they can exist at the same time, especially when any implementation wants to generate different coverings together with bbox. Actually I believe it is better to avoid this level of nesting to eliminate to confusion.
* coordinates from each axis. Values of Z and M are omitted for 2D geometries. | ||
*/ | ||
struct BoundingBox { | ||
1: required double xmin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a flat structure here is used to avoid thrift nested serialization overhead of defining MinMax or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just want to avoid another level of nesting.
1: optional Covering covering; | ||
|
||
/** | ||
* The geometry types of all geometries, or an empty array if they are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty array or geometry types are not populated or both indicate unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied it from the GeoParquet specs: An empty array explicitly signals that the geometry types are not known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you can make the geometry types required here, since the statistics themselves are optional. It is not possible to calculate a WKB bounding box without inspecting the uint32_t
that defines both the geometry type and dimensions.
In the GeoParquet spec the bbox
is optional in the metadata, and so there needed to be route to allow for the situation where a writer didn't want to make a pass over the input. The canonical route to that here would be omitting the statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geometry_types
in the GeoParquet is a list of string. Should we use enum instead? However, we need to define items for different dimensions, e.g. "POINT" and "POINT Z".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An integer might make more sense if text is not appropriate (text is nice for JSON metadata but perhaps not for Parquet implementations). Since this is a WKB type, the integer used to encode the geometry type and dimensions is very easy to extract: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary .
The integers used in WKB (the ISO variety, not EWKB) have the property that type_id / 1000
is either 0, 1, 2, or 3 and type_id % 1000
is the geometry_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we can directly define enum like below:
enum GeometrySubType {
POINT = 1;
LINESTRING = 2;
POLYGON = 3;
MULTIPOINT = 4;
MULTILINESTRING = 5;
MULTIPOLYGON = 6;
GEOMETRYCOLLECTION = 7;
POINT_Z = 1001;
LINESTRING_Z = 1002;
POLYGON_Z = 1003;
MULTIPOINT_Z = 1004;
MULTILINESTRING_Z = 1005;
MULTIPOLYGON_Z = 1006;
GEOMETRYCOLLECTION_Z = 1007;
POINT_M = 2001;
LINESTRING_M = 2002;
POLYGON_M = 2003;
MULTIPOINT_M = 2004;
MULTILINESTRING_M = 2005;
MULTIPOLYGON_M = 2006;
GEOMETRYCOLLECTION_M = 2007;
POINT_ZM = 3001;
LINESTRING_ZM = 3002;
POLYGON_ZM = 3003;
MULTIPOINT_ZM = 3004;
MULTILINESTRING_ZM = 3005;
MULTIPOLYGON_ZM = 3006;
GEOMETRYCOLLECTION_ZM = 3007;
}
If the above list if too verbose and hard to maintain. We can simply use list<i32>
to hold integer codes.
src/main/thrift/parquet.thrift
Outdated
* Additional informative metadata. | ||
* It can be used by GeoParquet to offload some of the column metadata. | ||
*/ | ||
4: optional string metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be string or binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, binary
sounds better which allows custom encoding.
@wgtmac sorry for the late review, mostly comments on being more explicit on semantics in some places. |
Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.