PARQUET-2471: Add geometry logical type #240

wgtmac · 2024-05-10T14:56:04Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

jiayuasu · 2024-05-10T20:13:12Z

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

jiayuasu · 2024-05-10T20:15:13Z

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

wgtmac · 2024-05-11T01:23:58Z

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

szehon-ho

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift

szehon-ho · 2024-05-11T06:55:58Z

src/main/thrift/parquet.thrift

+   * For now only OGC:CRS84 is supported.
+   */
+  3: optional string crs;
+  4: required Edges edges;


If we support Edges=spherical, looks like we need to support 'orientation' in order to correctly interpret polygons. See: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit?disco=AAABL-z6zbI

Let me take a look.

in order to correctly interpret polygons

Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).

Let me add a orientation enum field with two values: counterclockwise and clockwise.

We had a very extensive discussion in GeoParquet to get to our choice of 'If present must be "counterclockwise"; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order.' See this comment: opengeospatial/geoparquet#46 (comment) from the main author of GDAL, and indeed the entire discussion on that issue.

Basically there aren't any existing specs that give 2 options, so we didn't want to be the first, but structured it so that 'clockwise' could be added later.

szehon-ho · 2024-05-11T07:15:51Z

src/main/thrift/parquet.thrift

+   * The dimension of the geometry.
+   * For now only 2D geometry is supported and the value must be 2 if set.
+   */
+  2: optional byte dimension;


not sure if its too ambitious, as bounding box would not work for 3d.

Perhaps we can define different ColumnOrder for 3D? Any good zone-map candidate to use for 3D?

Yea maybe it is possible here, just min (x,y,z) and max (x,y,z) coordinates. cc @jiayuasu to check my understanding

dimension (in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).

+1 to enum.

There is xy, xyz, xym, xyzm: https://postgis.net/docs/ST_NDims.html

And, ST_Dimenion usually refers to point, line, polygon: https://postgis.net/docs/ST_Dimension.html

@szehon-ho what do you mean by bbox would not work for 3d?

If the 3d you mean is xyz, then a 3D bbox should work.
If the 3d you mean is spherical, then the coordinates are still 2D and a 2D bbox will not work

Yea sorry , hence my second comment pulling you in for your expertise :)

If the dimension is xyz, how to represent the 3D bbox? Is it still enough to use the (x_min, y_min, z_min) and (x_max, y_max, z_max) pair? For xym and xyzm, do we actually need m_min, m_max in the bbox?

I would probably expect the bounding box to include M if this dimensions value were set to XYM or XYZM (although it is probably rarely if ever used). There is some precedent for a bounding box having an independent dimensions value from the data ( http://www.geopackage.org/spec/#gpb_format ... see "envelope contents indicator").

Also probably worth mentioning in the comment here that if this value is set that the values in the column must have those dimensions (I assume that was the intention of the optional).

src/main/thrift/parquet.thrift

szehon-ho · 2024-05-11T08:02:30Z

src/main/thrift/parquet.thrift

-    * arrays do not include a length prefix.
+    * Values are encoded using PLAIN encoding, except that:
+    * 1) variable-length byte arrays do not include a length prefix.
+    * 2) geometry logical type with BoundingBoxOrder uses max_value/min_value pair


This is great to have as part of statitics, but trying to wrap my head around how writer implementations in WKB case can get min_value/max_value, not sure if they have the bounding box easily unless they deserialze. I think that's where the advanced geospatial encoding can be more beneficial.

You're right. As you can see that max_value and min_value here are of binary type, which is the serialized form and cannot be consumed directly. AFAIK, the C++ and Java Parquet implementations have provided functions to access the deserialized values from the Statistics class. We can also define an easy-to-use bounding box class and get it from Statistics implementation, but that's not the topic of the specs.

trying to wrap my head around how writer implementations in WKB case can get min_value/max_value

We have two options here:

If the input data is in a well-defined geometry object, collecting bounding box should be easy but the interface of the parquet library would be complicated.

The parquet writer only accepts WKB-encoded binary data, then the writer is required to deserialize values to get the coordinates which degrade the performance.

Yea I guess option 1 makes sense here. In this use case, I need to think it through, but we may not even need advanced 'native' encoding options as long as we can save the bounding box in Parquet stats? cc @jiayuasu

I guess if Parquet Java writer/reader may need to depend on JTS Geometry though, which is the normal in memory representation, unless we come up with something else here.

Yes, option 1 is more efficient but option 2 might be easier for different parquet impls. parquet-mr (which is the java impl of parquet) will depend on JTS and it is pretty natural to accept JTS Geometry as input data. However, for other parquet impls (e.g. parquet-cpp from arrow cpp, or parquet rust from arrow-rs), perhaps we need to leverage GeoArrow?

cc @pitrou @mapleFU @tustvold @zeroshade @etseidl to get awareness of this.

@szehon-ho

I think the native encoding is no longer really needed if the writer can do bbox itself.

Both option 1 and 2 sound fine to me. And, if the performance is NOT degraded too much, maybe Option 2 is even a better choice in terms of adoption?

CC @zhangfengcdt @Kontinuation

src/main/thrift/parquet.thrift

emkornfield · 2024-05-14T06:09:06Z

src/main/thrift/parquet.thrift

+   * x_max, y_min, and y_max of all coordinates from all geometry values. For simplexty,
+   * min_value field in the Statistics/ColumnIndex is encoded as the concatenation of
+   * PLAIN-encoded DOUBLE-typed x_min and y_min values. Similarly, max_value field is
+   * encoded as the concatenation of PLAIN-encoded DOUBLE-typed x_max and y_max values.


Are there possibly other encodings that might be more useful here (I seem to recall that using interleaved encoding (i.e. z-order) might be useful for searching but I'd have to go back and do some reading to confirm.

CMIW, z-order or hilbert curve apply to points only. We might need to obtain a point (e.g. centroid) from geometries to apply space-filling curve but I suspect the usability compared to bounding box.

+1. z-order or hilbert curve apply to points only. Unless we use something like XZ2

An S2 or H3 covering would be useful as a column statistic here...it provides a much better description of a "bounding" on the sphere that readers can leverage to skip reads. I'm not familiar with the limitations here but can we provide optionally all of these? (S2 and H3 are both a variable-sized list of int64).

Here the column statistics are the aggregate index for a group of geometry values. For example, usually the column statistics are collected from a data page of 10,000 values where bbox is more suitable. CMIW, S2/H3 covering might be more efficient in spatial join or record-level filtering.

I think the suitability of a bbox vs. a covering is orthogonal to the number of values...a single bounding box is not able to represent things like Alaska or Fiji (which straddle the antimeridian) or Antarctica (which covers a pole). There are workarounds, of course: you can use two bounding boxes in some cases, use some conventions for bounding boxes that cover a pole, specify that if min_lat > max_lat that the bounding box straddles the antimeridian, etc.

The GEOJSON spec uses "southwest" and "northeast" to remove the antimeridian bbox ambiguity ( https://datatracker.ietf.org/doc/html/rfc7946#section-5 ).

A somewhat future-proof way to support discrete global grid coverings (e.g., S2, H3) would be to add an optional WKB-polygon to be specified here (i.e., quite literally a polygon that covers all the features in the row group).

pitrou · 2024-05-15T08:24:29Z

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

pitrou · 2024-05-15T08:36:13Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

paleolimbot

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift

paleolimbot · 2024-05-15T12:47:46Z

src/main/thrift/parquet.thrift

+   * The dimension of the geometry.
+   * For now only 2D geometry is supported and the value must be 2 if set.
+   */
+  2: optional byte dimension;


dimension (in the singular) is typically referring to the point (0), line (1) or polygon (2)-ness of a feature. I would recommend using an enum here to cover all the options (xy, xyz, xym, xyzm).

paleolimbot · 2024-05-15T12:49:19Z

src/main/thrift/parquet.thrift

+   * For now only OGC:CRS84 is supported.
+   */
+  3: optional string crs;
+  4: required Edges edges;


in order to correctly interpret polygons

Technically that only applies to polygons that cover more than half the earth. As long as your polygons fit within a hemisphere (as most polygons do), you don't need to know if they are oriented (although it is faster to import them if you do know this in advance).

wgtmac · 2024-05-17T15:46:24Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

wgtmac · 2024-05-17T15:54:38Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

paleolimbot · 2024-05-17T19:48:56Z

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

wgtmac · 2024-05-18T02:46:21Z

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Kontinuation · 2024-05-18T06:15:01Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

paleolimbot · 2024-05-20T02:43:39Z

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

paleolimbot · 2024-05-21T13:32:08Z

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

src/main/thrift/parquet.thrift

jorisvandenbossche · 2024-05-21T15:20:13Z

Just to ensure my understanding is correct:

This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

jorisvandenbossche · 2024-05-21T16:03:09Z

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

wgtmac · 2024-05-21T16:09:38Z

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

jornfranke · 2024-05-21T19:55:14Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

szehon-ho · 2024-05-21T21:14:39Z

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

cholmes · 2024-05-23T20:55:53Z

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

wgtmac · 2024-05-27T01:52:27Z

I think I have collected sufficient comments and suggestions to this proposal and have modified it with following changes:

Removed explicit geo-specific metadata as much as possible and added an optional metadata field of string type. This makes it easy for Apache Iceberg adoption by simply enabling the new logical type without extra setup. GeoParquet can leverage the metadata field to offload the JSON style column metadata.
Added GeometryStatistics struct to store various statistics including bbox, H2/S3 covering and geometry types, etc. They can be put at page-level, row-group-level, and even file-level (which is in the logical type metadata).
I tried to add native encoding (same as GeoParquet/GeoArrow) as well. But at the moment parquet spec does not allow statistics on non-leaf column, meaning that it might be hacky to implement GeometryStatistics to native encoding types. I'm open to suggestion on this.

wgtmac · 2024-05-27T02:03:15Z

As the discussion is getting mature, I'd like to seek opinions from parquet community. Sorry to bother but we need to be serious about format change. cc @emkornfield @gszadovszky @julienledem @pitrou @shangxinli @tustvold @xhochy

pitrou · 2024-05-27T08:28:20Z

src/main/thrift/parquet.thrift

+  /** Covering of geometries as a list of Google S2 cell ids */
+  2: list<i64> s2_cell_ids;
+  /** Covering of geometries as a list of Uber H3 indices */
+  3: list<i64> h3_indices;


Is it really forward-looking to hardcode company-specific statistics in the schema?

S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).

That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).

I am not super familiar with thrift's compatibility, but is it possible to leave this out for now, and add additional fields later without breaking backcompat?

If that's the case, I think I would prefer leaving those out for now.

I also think that these are a little new for initial change to the Parquet spec (but will defer to @jiayuasu who may have a more pressing use-case for them).

I agree that S2 and H3 are probably too new to add at this time. The main idea is to show we can extend more specific types of geospatial statistics at this place.

BTW, S2 and H3 are widely used and licensed under Apache 2.0, we can remove the specific company names when we decide to add them.

I am fine either way (adding s2/h3 or leave them out). But if we want to add them, don't we also need to provide the precision / level of these IDs? Or one can tell the precision from the cell id itself?

I agree that I have missed precision / level of these IDs...

paleolimbot

Thank you for these changes! Some comments from the spatial end of things 🙂

src/main/thrift/parquet.thrift

paleolimbot · 2024-05-27T12:22:22Z

src/main/thrift/parquet.thrift

+  /** Covering of geometries as a list of Google S2 cell ids */
+  2: list<i64> s2_cell_ids;
+  /** Covering of geometries as a list of Uber H3 indices */
+  3: list<i64> h3_indices;


S2 and H3 are maybe similar to Snappy compression here (there are widely used open source implementations each supported by one company but no "standard" in the ISO or OGC sense).

That said, I think that a more future-proof/vendor-agnostic thing to put here would be the bytes of a WKB-encoded polygon (which can encode an s2 or h3 covering, albeit less compactly). It would also need an edges field that is independent of the edges of the logical type (s2 and h3 coverings both have spherical edges but the data may not).

src/main/thrift/parquet.thrift

paleolimbot · 2024-05-27T12:28:53Z

src/main/thrift/parquet.thrift

+   * WARNING: GeometryStatistics cannot be enabled for these encodings because
+   * only leaf columns can have column statistics and page index.


In this case, the statistics for the leaf columns contain equivalent information to the bounding box (which might be worth mentioning here).

I think WKB = 0 is pretty clear, but the GeoArrow ones are not so clear because they dictate that entire type must be one Geometry subtype (ie all Points, or all Lines, etc)? And so it can be limiting for users.

Maybe as you already saw, but we were chatting with @jiayuasu on the Iceberg proposal , and they had experimented with encodings like GeoLake native, that is a native encoding but re-uses same schema to represent all Geometry subtypes?

As it may be a longer debate, may be we can delay adding non-WKB ones in this first pass?

The current PR tries to add maximum support as far as I can see (including the native encoding type and various statistics) to seek opinions. I agree that we need to remove controversial/unclear things off in the final PR.

Yes, as we discussed before, the current native encoding in GeoParquet (borrowed from GeoArrow) does not allow mixed types of geometries in the same column. In addition, its current design makes it hard for Iceberg to adopt. But I chatted with @paleolimbot last week, and GeoArrow community is willing to make necessary changes to make it work.

However, the main reason of having the native encoding in GeoParquet is to compute min/max statistics. If Parquet allows bbox / s2 / h3 as the native statistics, the native encoding does not seem to be necessary. This will greatly increase the adoption of Parquet / GeoParquet.

So, given the uncertainty of this topic, maybe we can remove this in the final PR.

paleolimbot · 2024-05-27T12:34:31Z

src/main/thrift/parquet.thrift

+  /**
+   * Additional informative metadata.
+   * It can be used by GeoParquet to offload some of the column metadata.
+   */
+  2: optional string metadata;


I know I was supportive of doing this initially; however, I think this would better accomplish its goal of interoperability to explicitly spell out crs and edges in thrift language (as you did in a previous version of this PR). (As long are the Parquet community will be receptive to the open source geospatial community's feedback when current best practice changes, and so far it seems as though they are 🙂 )

One absolutely critical reason to do this is that one cannot calculate a bounding box in the normal way if a geometry has spherical edges. Similarly, for geographic coordinates, if an edge crosses the antimeridian, the bbox will be invalid. If we include those details as part of the logical type, geo-naive writers can safely fill in column statistics with a fairly basic WKB parser.

That makes sense. I'll add them back with explanation above.

src/main/thrift/parquet.thrift

kylebarron · 2024-05-28T09:16:15Z

Another GeoParquet member checking in: My main question/concern about this is the degree to which it forces generic Parquet implementations to be aware of and care about geospatial data types.

One of the best parts of GeoParquet is that it works with any existing Parquet implementation, as long as you can access the file metadata, because it can exist as a separate layer on top of Parquet. If existing Parquet implementations are updated such that they don't crash on unknown logical types, but don't have geospatial-specific support, will it be possible for users to access this geospatial logical type information directly? Or will this require that all Parquet implementations have some baseline level of geospatial support? In the latter case, I would worry significantly about ecosystem fragmentation.

paleolimbot · 2024-05-28T13:40:02Z

That is a great point!

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader. In Arrow C++ or pyarrow I believe there is access to a Parquet logical type already (more difficult than file-level metadata, which was already propagated to a the Arrow schema, albeit incorrectly sometimes if there were multiple files or a renamed column involved). The second level might be the ability to write column statistics, which would require a WKB parser.

The flip side of this argument is that embedding geospatial details in the format allows Parquet to be a more effective geospatial file format for the readers/writers that do care about these details.

pitrou · 2024-05-28T13:46:51Z

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader.

This seems to be another argument in favor of an "extension type" facility in Parquet.

pitrou · 2024-05-28T14:47:05Z

Started a discussion about extension types here: https://lists.apache.org/thread/9xo3mp4n23p8psmrhso5t9q899vxwfjt

paleolimbot · 2024-05-28T18:26:20Z

This seems to be another argument in favor of an "extension type" facility in Parquet.

A Parquet extension mechanism seems natural to me based on my previous work with Arrow extension types (which I think are great!), but it is also worth mentioning that the implementation work/choices is/are the same either way: implementations can choose to ignore GEOMETRY types, treat them as their storage type, add some infrastructure to inject support for reading/writing/pushing down filters into statistics, or build basic support for those things into the implementation. The ability to treat unknown logical types (extension or not) and ability to interact with Parquet files at a lower level are good things for an implementation to do regardless of the extension capability of the format.

- remove file-level geo stats - add custom wkb-encoded geometry stats - comment out controversial items

wgtmac · 2024-05-31T03:39:44Z

I have just updated the PR with following changes:

Commented out S2Index, H3Index and native single-geometry encodings. Leave them here to demonstrate possible future extension.
Added a custom WKB-encoded geometry struct (as suggested by @paleolimbot) which could be used to represent coverings including bbox, S2, H3, etc.
Removed file-level geo stats as parquet generally does not support file-level stats of other types.
Added explicit crs and edges fields to the new logical type.

paleolimbot

Thank you!

There is room for more precisely documenting some of the parameters (e.g., using language from the GeoParquet specification); however, I think others may have some comments regarding exactly which parameters are here and so I will leave those comments for if/when there is an agreement in principle 🙂

paleolimbot · 2024-05-31T15:04:09Z

src/main/thrift/parquet.thrift

+/** S2 spatial index: http://s2geometry.io/ */
+struct S2Index {
+  /** Level of S2 cell ids. valid range is [0, 30] */
+  1: required i32 level;


It may be worth removing this since it's no longer in the statistics?

Also: To my knowledge, level is embedded in the S2/H3 cell ID, and not all cell IDs in a covering need to have the same level (this is a useful property).

paleolimbot · 2024-05-31T15:06:56Z

src/main/thrift/parquet.thrift

+union Envelope {
+  1:  BoundingBox bbox    // A bounding box of geometries if it can be built.
+  2:  Geometry covering   // A covering polygon of geometries if bbox is unavailable.
+}


To avoid overloading various terms that are used elsewhere in the geospatial world, you probably want:

struct Covering { optional BoundingBox bbox // A bounding box of geometries if it can be built. optional Geometry covering // A covering polygon of geometries if bbox is unavailable. }

Unless there's some technical limitation that providing only one of these is required, maybe use a struct?

(The term Envelope is defined somewhere in the simple features/SQL spec as very specifically the POLYGON representation of the bounding box)

Also, if the optional Geometry covering is controversial we can drop it 🙂

wgtmac · 2024-06-13T06:57:07Z

Thanks @paleolimbot for the detail feedback! I have changed the name and removed controversial items. I will start a PoC implementation in parquet-java later this week if there is no strong objection.

emkornfield · 2024-06-14T05:29:27Z

src/main/thrift/parquet.thrift

+   * Edges of the geometry if it is a polygon. It may be different to the
+   * edges attribute from the GEOMETRY logical type.
+   */
+  2: optional Edges edges;


might be worth specifying whether this is logically required or if a default is assumed.

I'm inclined to make it required because the geometry is anyway a polygon which has this attribute. Though writer who generates this geometry statistics is required to fill the Edges field, it makes the reader easier in that it does not need to derive this attribute from logical type. Sometimes the values in the geometry column do not have any polygon value, in which case the logical type may not have the Edges field. cc @paleolimbot

I think having this be required and explicitly marked as planar is appropriate (even if the column contains all points, this is a general-purpose geometry type). Explicitly marking it is better than having two ways to specify planar edges (by omission and by explicitly setting the value).

Converting between planar and spherical edges should always be an explicit operation (that happens to be very easy for points).

Is it better to use bool is_spherical instead of an enum Edges edges given the fact that it has only two options? @paleolimbot

That's a great question...a boolean leaves no room for a new edge interpolation mechanism to exist (and there is one, ellipsoidal edges, that nobody uses in practice but does exist).

emkornfield · 2024-06-14T05:30:59Z

src/main/thrift/parquet.thrift

+}
+
+struct Covering {
+  optional BoundingBox bbox    // A bounding box of geometries if it can be built.


please clarify if these are mutually exclusive on a covering?

No, they can exist at the same time, especially when any implementation wants to generate different coverings together with bbox. Actually I believe it is better to avoid this level of nesting to eliminate to confusion.

emkornfield · 2024-06-14T05:31:23Z

src/main/thrift/parquet.thrift

+ * coordinates from each axis. Values of Z and M are omitted for 2D geometries.
+ */
+struct BoundingBox {
+  1: required double xmin;


a flat structure here is used to avoid thrift nested serialization overhead of defining MinMax or something similar?

Yes, just want to avoid another level of nesting.

emkornfield · 2024-06-14T05:32:13Z

src/main/thrift/parquet.thrift

+  1: optional Covering covering;
+
+  /**
+   * The geometry types of all geometries, or an empty array if they are not


empty array or geometry types are not populated or both indicate unknown?

I copied it from the GeoParquet specs: An empty array explicitly signals that the geometry types are not known.

I think that you can make the geometry types required here, since the statistics themselves are optional. It is not possible to calculate a WKB bounding box without inspecting the uint32_t that defines both the geometry type and dimensions.

In the GeoParquet spec the bbox is optional in the metadata, and so there needed to be route to allow for the situation where a writer didn't want to make a pass over the input. The canonical route to that here would be omitting the statistics.

geometry_types in the GeoParquet is a list of string. Should we use enum instead? However, we need to define items for different dimensions, e.g. "POINT" and "POINT Z".

An integer might make more sense if text is not appropriate (text is nice for JSON metadata but perhaps not for Parquet implementations). Since this is a WKB type, the integer used to encode the geometry type and dimensions is very easy to extract: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary .

The integers used in WKB (the ISO variety, not EWKB) have the property that type_id / 1000 is either 0, 1, 2, or 3 and type_id % 1000 is the geometry_type.

It sounds like we can directly define enum like below:

enum GeometrySubType { POINT = 1; LINESTRING = 2; POLYGON = 3; MULTIPOINT = 4; MULTILINESTRING = 5; MULTIPOLYGON = 6; GEOMETRYCOLLECTION = 7; POINT_Z = 1001; LINESTRING_Z = 1002; POLYGON_Z = 1003; MULTIPOINT_Z = 1004; MULTILINESTRING_Z = 1005; MULTIPOLYGON_Z = 1006; GEOMETRYCOLLECTION_Z = 1007; POINT_M = 2001; LINESTRING_M = 2002; POLYGON_M = 2003; MULTIPOINT_M = 2004; MULTILINESTRING_M = 2005; MULTIPOLYGON_M = 2006; GEOMETRYCOLLECTION_M = 2007; POINT_ZM = 3001; LINESTRING_ZM = 3002; POLYGON_ZM = 3003; MULTIPOINT_ZM = 3004; MULTILINESTRING_ZM = 3005; MULTIPOLYGON_ZM = 3006; GEOMETRYCOLLECTION_ZM = 3007; }

If the above list if too verbose and hard to maintain. We can simply use list<i32> to hold integer codes.

src/main/thrift/parquet.thrift

emkornfield · 2024-06-14T05:33:53Z

src/main/thrift/parquet.thrift

+   * Additional informative metadata.
+   * It can be used by GeoParquet to offload some of the column metadata.
+   */
+  4: optional string metadata;


should this be string or binary?

Good question, binary sounds better which allows custom encoding.

emkornfield · 2024-06-14T05:35:19Z

@wgtmac sorry for the late review, mostly comments on being more explicit on semantics in some places.

wgtmac force-pushed the geo branch from 4d36df9 to ad29afd Compare May 10, 2024 15:01

szehon-ho reviewed May 11, 2024

View reviewed changes

wgtmac marked this pull request as ready for review May 11, 2024 16:13

wgtmac changed the title ~~WIP: Add geometry logical type~~ PARQUET-2471: Add geometry logical type May 11, 2024

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

paleolimbot reviewed May 15, 2024

View reviewed changes

paleolimbot mentioned this pull request May 21, 2024

Thoughts about a first-class GEOMETRY data type in Parquet? opengeospatial/geoparquet#222

Open

jorisvandenbossche reviewed May 21, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot mentioned this pull request May 21, 2024

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader apache/arrow#41764

Open

WIP: Add geometry logical type

f7b956b

address various comments

a1472f0

wgtmac force-pushed the geo branch from f71c010 to a1472f0 Compare May 25, 2024 15:29

add file level geo stats

1d583d5

pitrou reviewed May 27, 2024

View reviewed changes

paleolimbot reviewed May 27, 2024

View reviewed changes

address feedback:

3a23ba1

- remove file-level geo stats - add custom wkb-encoded geometry stats - comment out controversial items

paleolimbot reviewed May 31, 2024

View reviewed changes

marklit mentioned this pull request Jun 4, 2024

GeoParquet Support? duckdb/duckdb_spatial#27

Open

change naming and remove controversial items

85acff9

emkornfield reviewed Jun 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed Jun 14, 2024

View reviewed changes

wgtmac added 3 commits June 16, 2024 15:30

address feedback

4b8e1f8

fix typo

69b5978

use WKB type code

ea12bd2

wgtmac mentioned this pull request Jun 20, 2024

PARQUET-2471: Add support for geometry logical type apache/parquet-java#1379

Draft

5 tasks

asfimport mentioned this pull request Jun 23, 2024

Add geometry logical type #429

Open

		* WARNING: GeometryStatistics cannot be enabled for these encodings because
		* only leaf columns can have column statistics and page index.

PARQUET-2471: Add geometry logical type #240

Are you sure you want to change the base?

PARQUET-2471: Add geometry logical type #240

Conversation

wgtmac commented May 10, 2024

jiayuasu commented May 10, 2024

jiayuasu commented May 10, 2024

wgtmac commented May 11, 2024

szehon-ho left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho May 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented May 15, 2024 • edited Loading

pitrou commented May 15, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented May 17, 2024 • edited Loading

wgtmac commented May 17, 2024

paleolimbot commented May 17, 2024

wgtmac commented May 18, 2024

Kontinuation commented May 18, 2024

paleolimbot commented May 20, 2024

paleolimbot commented May 21, 2024

jorisvandenbossche commented May 21, 2024

jorisvandenbossche commented May 21, 2024

wgtmac commented May 21, 2024

jornfranke commented May 21, 2024 • edited Loading

szehon-ho commented May 21, 2024

cholmes commented May 23, 2024

wgtmac commented May 27, 2024 • edited Loading

wgtmac commented May 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented May 28, 2024

paleolimbot commented May 28, 2024 • edited Loading

szehon-ho left a comment •

edited

Loading

wgtmac May 17, 2024 •

edited

Loading

szehon-ho May 11, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

wgtmac commented May 17, 2024 •

edited

Loading

jornfranke commented May 21, 2024 •

edited

Loading

wgtmac commented May 27, 2024 •

edited

Loading

paleolimbot commented May 28, 2024 •

edited

Loading

wgtmac Jun 16, 2024 •

edited

Loading