Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 10, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

@jiayuasu
Copy link
Member

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu
Copy link
Member

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

@wgtmac
Copy link
Member Author

wgtmac commented May 11, 2024

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

Copy link

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
* For now only OGC:CRS84 is supported.
*/
3: optional string crs;
4: required Edges edges;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support Edges=spherical, looks like we need to support 'orientation' in order to correctly interpret polygons. See: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6zbI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to correctly interpret polygons

Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a orientation enum field with two values: counterclockwise and clockwise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a very extensive discussion in GeoParquet to get to our choice of 'If present must be "counterclockwise"; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order.' See this comment: opengeospatial/geoparquet#46 (comment) from the main author of GDAL, and indeed the entire discussion on that issue.

Basically there aren't any existing specs that give 2 options, so we didn't want to be the first, but structured it so that 'clockwise' could be added later.

* The dimension of the geometry.
* For now only 2D geometry is supported and the value must be 2 if set.
*/
2: optional byte dimension;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if its too ambitious, as bounding box would not work for 3d.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can define different ColumnOrder for 3D? Any good zone-map candidate to use for 3D?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea maybe it is possible here, just min (x,y,z) and max (x,y,z) coordinates. cc @jiayuasu to check my understanding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimension (in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to enum.

There is xy, xyz, xym, xyzm: https://postgis.net/docs/ST_NDims.html

And, ST_Dimenion usually refers to point, line, polygon: https://postgis.net/docs/ST_Dimension.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho what do you mean by bbox would not work for 3d?

If the 3d you mean is xyz, then a 3D bbox should work.
If the 3d you mean is spherical, then the coordinates are still 2D and a 2D bbox will not work

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sorry , hence my second comment pulling you in for your expertise :)

Copy link
Member Author

@wgtmac wgtmac May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dimension is xyz, how to represent the 3D bbox? Is it still enough to use the (x_min, y_min, z_min) and (x_max, y_max, z_max) pair? For xym and xyzm, do we actually need m_min, m_max in the bbox?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably expect the bounding box to include M if this dimensions value were set to XYM or XYZM (although it is probably rarely if ever used). There is some precedent for a bounding box having an independent dimensions value from the data ( http://www.geopackage.org/spec/#gpb_format ... see "envelope contents indicator").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably worth mentioning in the comment here that if this value is set that the values in the column must have those dimensions (I assume that was the intention of the optional).

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
* arrays do not include a length prefix.
* Values are encoded using PLAIN encoding, except that:
* 1) variable-length byte arrays do not include a length prefix.
* 2) geometry logical type with BoundingBoxOrder uses max_value/min_value pair
Copy link

@szehon-ho szehon-ho May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to have as part of statitics, but trying to wrap my head around how writer implementations in WKB case can get min_value/max_value, not sure if they have the bounding box easily unless they deserialze. I think that's where the advanced geospatial encoding can be more beneficial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. As you can see that max_value and min_value here are of binary type, which is the serialized form and cannot be consumed directly. AFAIK, the C++ and Java Parquet implementations have provided functions to access the deserialized values from the Statistics class. We can also define an easy-to-use bounding box class and get it from Statistics implementation, but that's not the topic of the specs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to wrap my head around how writer implementations in WKB case can get min_value/max_value

We have two options here:

  1. If the input data is in a well-defined geometry object, collecting bounding box should be easy but the interface of the parquet library would be complicated.
  2. The parquet writer only accepts WKB-encoded binary data, then the writer is required to deserialize values to get the coordinates which degrade the performance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I guess option 1 makes sense here. In this use case, I need to think it through, but we may not even need advanced 'native' encoding options as long as we can save the bounding box in Parquet stats? cc @jiayuasu

I guess if Parquet Java writer/reader may need to depend on JTS Geometry though, which is the normal in memory representation, unless we come up with something else here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, option 1 is more efficient but option 2 might be easier for different parquet impls. parquet-mr (which is the java impl of parquet) will depend on JTS and it is pretty natural to accept JTS Geometry as input data. However, for other parquet impls (e.g. parquet-cpp from arrow cpp, or parquet rust from arrow-rs), perhaps we need to leverage GeoArrow?

cc @pitrou @mapleFU @tustvold @zeroshade @etseidl to get awareness of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho

I think the native encoding is no longer really needed if the writer can do bbox itself.

Both option 1 and 2 sound fine to me. And, if the performance is NOT degraded too much, maybe Option 2 is even a better choice in terms of adoption?

CC @zhangfengcdt @Kontinuation

@wgtmac wgtmac marked this pull request as ready for review May 11, 2024 16:13
@wgtmac wgtmac changed the title WIP: Add geometry logical type PARQUET-2471: Add geometry logical type May 11, 2024
* x_max, y_min, and y_max of all coordinates from all geometry values. For simplexty,
* min_value field in the Statistics/ColumnIndex is encoded as the concatenation of
* PLAIN-encoded DOUBLE-typed x_min and y_min values. Similarly, max_value field is
* encoded as the concatenation of PLAIN-encoded DOUBLE-typed x_max and y_max values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there possibly other encodings that might be more useful here (I seem to recall that using interleaved encoding (i.e. z-order) might be useful for searching but I'd have to go back and do some reading to confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIW, z-order or hilbert curve apply to points only. We might need to obtain a point (e.g. centroid) from geometries to apply space-filling curve but I suspect the usability compared to bounding box.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. z-order or hilbert curve apply to points only. Unless we use something like XZ2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An S2 or H3 covering would be useful as a column statistic here...it provides a much better description of a "bounding" on the sphere that readers can leverage to skip reads. I'm not familiar with the limitations here but can we provide optionally all of these? (S2 and H3 are both a variable-sized list of int64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the column statistics are the aggregate index for a group of geometry values. For example, usually the column statistics are collected from a data page of 10,000 values where bbox is more suitable. CMIW, S2/H3 covering might be more efficient in spatial join or record-level filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suitability of a bbox vs. a covering is orthogonal to the number of values...a single bounding box is not able to represent things like Alaska or Fiji (which straddle the antimeridian) or Antarctica (which covers a pole). There are workarounds, of course: you can use two bounding boxes in some cases, use some conventions for bounding boxes that cover a pole, specify that if min_lat > max_lat that the bounding box straddles the antimeridian, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GEOJSON spec uses "southwest" and "northeast" to remove the antimeridian bbox ambiguity ( https://datatracker.ietf.org/doc/html/rfc7946#section-5 ).

A somewhat future-proof way to support discrete global grid coverings (e.g., S2, H3) would be to add an optional WKB-polygon to be specified here (i.e., quite literally a polygon that covers all the features in the row group).

@pitrou
Copy link
Member

pitrou commented May 15, 2024

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

@pitrou
Copy link
Member

pitrou commented May 15, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
* The dimension of the geometry.
* For now only 2D geometry is supported and the value must be 2 if set.
*/
2: optional byte dimension;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimension (in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).

* For now only OGC:CRS84 is supported.
*/
3: optional string crs;
4: required Edges edges;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to correctly interpret polygons

Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).

@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

@wgtmac
Copy link
Member Author

wgtmac commented May 17, 2024

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

@paleolimbot
Copy link
Member

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

@wgtmac
Copy link
Member Author

wgtmac commented May 18, 2024

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

@Kontinuation
Copy link
Member

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

@paleolimbot
Copy link
Member

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

@paleolimbot
Copy link
Member

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

@jorisvandenbossche
Copy link
Member

Just to ensure my understanding is correct:

  • This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
  • For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

@jorisvandenbossche
Copy link
Member

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

@wgtmac
Copy link
Member Author

wgtmac commented May 21, 2024

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

@jornfranke
Copy link

jornfranke commented May 21, 2024

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

@szehon-ho
Copy link

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

@cholmes
Copy link

cholmes commented May 23, 2024

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

@wgtmac
Copy link
Member Author

wgtmac commented May 27, 2024

I think I have collected sufficient comments and suggestions to this proposal and have modified it with following changes:

  1. Removed explicit geo-specific metadata as much as possible and added an optional metadata field of string type. This makes it easy for Apache Iceberg adoption by simply enabling the new logical type without extra setup. GeoParquet can leverage the metadata field to offload the JSON style column metadata.
  2. Added GeometryStatistics struct to store various statistics including bbox, H2/S3 covering and geometry types, etc. They can be put at page-level, row-group-level, and even file-level (which is in the logical type metadata).
  3. I tried to add native encoding (same as GeoParquet/GeoArrow) as well. But at the moment parquet spec does not allow statistics on non-leaf column, meaning that it might be hacky to implement GeometryStatistics to native encoding types. I'm open to suggestion on this.

@wgtmac
Copy link
Member Author

wgtmac commented May 27, 2024

As the discussion is getting mature, I'd like to seek opinions from parquet community. Sorry to bother but we need to be serious about format change. cc @emkornfield @gszadovszky @julienledem @pitrou @shangxinli @tustvold @xhochy

Comment on lines 259 to 262
/** Covering of geometries as a list of Google S2 cell ids */
2: list<i64> s2_cell_ids;
/** Covering of geometries as a list of Uber H3 indices */
3: list<i64> h3_indices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really forward-looking to hardcode company-specific statistics in the schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).

That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with thrift's compatibility, but is it possible to leave this out for now, and add additional fields later without breaking backcompat?

If that's the case, I think I would prefer leaving those out for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that these are a little new for initial change to the Parquet spec (but will defer to @jiayuasu who may have a more pressing use-case for them).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that S2 and H3 are probably too new to add at this time. The main idea is to show we can extend more specific types of geospatial statistics at this place.

BTW, S2 and H3 are widely used and licensed under Apache 2.0, we can remove the specific company names when we decide to add them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine either way (adding s2/h3 or leave them out). But if we want to add them, don't we also need to provide the precision / level of these IDs? Or one can tell the precision from the cell id itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that I have missed precision / level of these IDs...

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these changes! Some comments from the spatial end of things 🙂

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Comment on lines 259 to 262
/** Covering of geometries as a list of Google S2 cell ids */
2: list<i64> s2_cell_ids;
/** Covering of geometries as a list of Uber H3 indices */
3: list<i64> h3_indices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).

That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
Comment on lines 436 to 437
* WARNING: GeometryStatistics cannot be enabled for these encodings because
* only leaf columns can have column statistics and page index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the statistics for the leaf columns contain equivalent information to the bounding box (which might be worth mentioning here).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think WKB = 0 is pretty clear, but the GeoArrow ones are not so clear because they dictate that entire type must be one Geometry subtype (ie all Points, or all Lines, etc)? And so it can be limiting for users.

Maybe as you already saw, but we were chatting with @jiayuasu on the Iceberg proposal , and they had experimented with encodings like GeoLake native, that is a native encoding but re-uses same schema to represent all Geometry subtypes?

As it may be a longer debate, may be we can delay adding non-WKB ones in this first pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR tries to add maximum support as far as I can see (including the native encoding type and various statistics) to seek opinions. I agree that we need to remove controversial/unclear things off in the final PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as we discussed before, the current native encoding in GeoParquet (borrowed from GeoArrow) does not allow mixed types of geometries in the same column. In addition, its current design makes it hard for Iceberg to adopt. But I chatted with @paleolimbot last week, and GeoArrow community is willing to make necessary changes to make it work.

However, the main reason of having the native encoding in GeoParquet is to compute min/max statistics. If Parquet allows bbox / s2 / h3 as the native statistics, the native encoding does not seem to be necessary. This will greatly increase the adoption of Parquet / GeoParquet.

So, given the uncertainty of this topic, maybe we can remove this in the final PR.

Comment on lines 470 to 474
/**
* Additional informative metadata.
* It can be used by GeoParquet to offload some of the column metadata.
*/
2: optional string metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I was supportive of doing this initially; however, I think this would better accomplish its goal of interoperability to explicitly spell out crs and edges in thrift language (as you did in a previous version of this PR). (As long are the Parquet community will be receptive to the open source geospatial community's feedback when current best practice changes, and so far it seems as though they are 🙂 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One absolutely critical reason to do this is that one cannot calculate a bounding box in the normal way if a geometry has spherical edges. Similarly, for geographic coordinates, if an edge crosses the antimeridian, the bbox will be invalid. If we include those details as part of the logical type, geo-naive writers can safely fill in column statistics with a fairly basic WKB parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'll add them back with explanation above.

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
@kylebarron
Copy link

Another GeoParquet member checking in: My main question/concern about this is the degree to which it forces generic Parquet implementations to be aware of and care about geospatial data types.

One of the best parts of GeoParquet is that it works with any existing Parquet implementation, as long as you can access the file metadata, because it can exist as a separate layer on top of Parquet. If existing Parquet implementations are updated such that they don't crash on unknown logical types, but don't have geospatial-specific support, will it be possible for users to access this geospatial logical type information directly? Or will this require that all Parquet implementations have some baseline level of geospatial support? In the latter case, I would worry significantly about ecosystem fragmentation.

@paleolimbot
Copy link
Member

paleolimbot commented May 28, 2024

That is a great point!

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader. In Arrow C++ or pyarrow I believe there is access to a Parquet logical type already (more difficult than file-level metadata, which was already propagated to a the Arrow schema, albeit incorrectly sometimes if there were multiple files or a renamed column involved). The second level might be the ability to write column statistics, which would require a WKB parser.

The flip side of this argument is that embedding geospatial details in the format allows Parquet to be a more effective geospatial file format for the readers/writers that do care about these details.

@pitrou
Copy link
Member

pitrou commented May 28, 2024

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader.

This seems to be another argument in favor of an "extension type" facility in Parquet.

@pitrou
Copy link
Member

pitrou commented May 28, 2024

Started a discussion about extension types here: https://lists.apache.org/thread/9xo3mp4n23p8psmrhso5t9q899vxwfjt

@paleolimbot
Copy link
Member

This seems to be another argument in favor of an "extension type" facility in Parquet.

A Parquet extension mechanism seems natural to me based on my previous work with Arrow extension types (which I think are great!), but it is also worth mentioning that the implementation work/choices is/are the same either way: implementations can choose to ignore GEOMETRY types, treat them as their storage type, add some infrastructure to inject support for reading/writing/pushing down filters into statistics, or build basic support for those things into the implementation. The ability to treat unknown logical types (extension or not) and ability to interact with Parquet files at a lower level are good things for an implementation to do regardless of the extension capability of the format.

- remove file-level geo stats
- add custom wkb-encoded geometry stats
- comment out controversial items
@wgtmac
Copy link
Member Author

wgtmac commented May 31, 2024

I have just updated the PR with following changes:

  • Commented out S2Index, H3Index and native single-geometry encodings. Leave them here to demonstrate possible future extension.
  • Added a custom WKB-encoded geometry struct (as suggested by @paleolimbot) which could be used to represent coverings including bbox, S2, H3, etc.
  • Removed file-level geo stats as parquet generally does not support file-level stats of other types.
  • Added explicit crs and edges fields to the new logical type.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

There is room for more precisely documenting some of the parameters (e.g., using language from the GeoParquet specification); however, I think others may have some comments regarding exactly which parameters are here and so I will leave those comments for if/when there is an agreement in principle 🙂

Comment on lines 287 to 290
/** S2 spatial index: http://s2geometry.io/ */
struct S2Index {
/** Level of S2 cell ids. valid range is [0, 30] */
1: required i32 level;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth removing this since it's no longer in the statistics?

Also: To my knowledge, level is embedded in the S2/H3 cell ID, and not all cell IDs in a covering need to have the same level (this is a useful property).

Comment on lines 282 to 285
union Envelope {
1: BoundingBox bbox // A bounding box of geometries if it can be built.
2: Geometry covering // A covering polygon of geometries if bbox is unavailable.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid overloading various terms that are used elsewhere in the geospatial world, you probably want:

struct Covering {
  optional BoundingBox bbox    // A bounding box of geometries if it can be built.
  optional Geometry covering   // A covering polygon of geometries if bbox is unavailable.
}

Unless there's some technical limitation that providing only one of these is required, maybe use a struct?

(The term Envelope is defined somewhere in the simple features/SQL spec as very specifically the POLYGON representation of the bounding box)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the optional Geometry covering is controversial we can drop it 🙂

@wgtmac
Copy link
Member Author

wgtmac commented Jun 13, 2024

Thanks @paleolimbot for the detail feedback! I have changed the name and removed controversial items. I will start a PoC implementation in parquet-java later this week if there is no strong objection.

* Edges of the geometry if it is a polygon. It may be different to the
* edges attribute from the GEOMETRY logical type.
*/
2: optional Edges edges;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth specifying whether this is logically required or if a default is assumed.

Copy link
Member Author

@wgtmac wgtmac Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to make it required because the geometry is anyway a polygon which has this attribute. Though writer who generates this geometry statistics is required to fill the Edges field, it makes the reader easier in that it does not need to derive this attribute from logical type. Sometimes the values in the geometry column do not have any polygon value, in which case the logical type may not have the Edges field. cc @paleolimbot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this be required and explicitly marked as planar is appropriate (even if the column contains all points, this is a general-purpose geometry type). Explicitly marking it is better than having two ways to specify planar edges (by omission and by explicitly setting the value).

Converting between planar and spherical edges should always be an explicit operation (that happens to be very easy for points).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use bool is_spherical instead of an enum Edges edges given the fact that it has only two options? @paleolimbot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question...a boolean leaves no room for a new edge interpolation mechanism to exist (and there is one, ellipsoidal edges, that nobody uses in practice but does exist).

}

struct Covering {
optional BoundingBox bbox // A bounding box of geometries if it can be built.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clarify if these are mutually exclusive on a covering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they can exist at the same time, especially when any implementation wants to generate different coverings together with bbox. Actually I believe it is better to avoid this level of nesting to eliminate to confusion.

* coordinates from each axis. Values of Z and M are omitted for 2D geometries.
*/
struct BoundingBox {
1: required double xmin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a flat structure here is used to avoid thrift nested serialization overhead of defining MinMax or something similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just want to avoid another level of nesting.

1: optional Covering covering;

/**
* The geometry types of all geometries, or an empty array if they are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty array or geometry types are not populated or both indicate unknown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it from the GeoParquet specs: An empty array explicitly signals that the geometry types are not known.

Copy link
Member

@paleolimbot paleolimbot Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can make the geometry types required here, since the statistics themselves are optional. It is not possible to calculate a WKB bounding box without inspecting the uint32_t that defines both the geometry type and dimensions.

In the GeoParquet spec the bbox is optional in the metadata, and so there needed to be route to allow for the situation where a writer didn't want to make a pass over the input. The canonical route to that here would be omitting the statistics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geometry_types in the GeoParquet is a list of string. Should we use enum instead? However, we need to define items for different dimensions, e.g. "POINT" and "POINT Z".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An integer might make more sense if text is not appropriate (text is nice for JSON metadata but perhaps not for Parquet implementations). Since this is a WKB type, the integer used to encode the geometry type and dimensions is very easy to extract: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary .

The integers used in WKB (the ISO variety, not EWKB) have the property that type_id / 1000 is either 0, 1, 2, or 3 and type_id % 1000 is the geometry_type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we can directly define enum like below:

enum GeometrySubType {
  POINT = 1;
  LINESTRING = 2;
  POLYGON = 3;
  MULTIPOINT = 4;
  MULTILINESTRING = 5;
  MULTIPOLYGON = 6;
  GEOMETRYCOLLECTION = 7;

  POINT_Z = 1001;
  LINESTRING_Z = 1002;
  POLYGON_Z = 1003;
  MULTIPOINT_Z = 1004;
  MULTILINESTRING_Z = 1005;
  MULTIPOLYGON_Z = 1006;
  GEOMETRYCOLLECTION_Z = 1007;

  POINT_M = 2001;
  LINESTRING_M = 2002;
  POLYGON_M = 2003;
  MULTIPOINT_M = 2004;
  MULTILINESTRING_M = 2005;
  MULTIPOLYGON_M = 2006;
  GEOMETRYCOLLECTION_M = 2007;

  POINT_ZM = 3001;
  LINESTRING_ZM = 3002;
  POLYGON_ZM = 3003;
  MULTIPOINT_ZM = 3004;
  MULTILINESTRING_ZM = 3005;
  MULTIPOLYGON_ZM = 3006;
  GEOMETRYCOLLECTION_ZM = 3007;
}

If the above list if too verbose and hard to maintain. We can simply use list<i32> to hold integer codes.

* Additional informative metadata.
* It can be used by GeoParquet to offload some of the column metadata.
*/
4: optional string metadata;
Copy link
Contributor

@emkornfield emkornfield Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be string or binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, binary sounds better which allows custom encoding.

@emkornfield
Copy link
Contributor

@wgtmac sorry for the late review, mostly comments on being more explicit on semantics in some places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet