Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add support for geometry logical type #1379

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jun 19, 2024

This is the PoC of apache/parquet-format#240

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Assert.assertNotNull(geometryStatistics);
System.out.println("GeometryStatistics");
System.out.println(geometryStatistics);
System.out.println();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just finished the PoC to read/write geometry values in the form of WKB and verify the GeometryStatistics have been collected as expected. This is the output of this minimal test.

Footer
ParquetMetaData{FileMetaData{schema: message msg {
  required binary col_geom (GEOMETRY(WKB,PLANAR,EPSG:4326));
}
, metadata: {writer.model.name=example}}, blocks: [BlockMetaData{2, 73, rowIndexOffset = 0 [ColumnMetaData{UNCOMPRESSED [col_geom] required binary col_geom (GEOMETRY(WKB,PLANAR,EPSG:4326))  [BIT_PACKED, PLAIN], 4}]}]}

Statistics
min: POINT (1 1), max: POINT (2 2), num_nulls: 0

GeometryStatistics
GeometryStatistics{boundingBox=BoundingBox{xMin=1.0, xMax=2.0, yMin=1.0, yMax=2.0, zMin=NaN, zMax=NaN, mMin=NaN, mMax=NaN}, covering=Covering{geometry=POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1)), edges=PLANAR}, geometryTypes=GeometryTypes{types=[1]}}

@szehon-ho @jiayuasu @paleolimbot @emkornfield

Please let me know what do you think. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help take an initial review to see if this looks promising? @gszadovszky @julienledem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, I am not really familiar with this new geometry logical type. So, my comments are more general.

I can see a TODO for the comparator but I cannot see any specification in the format PR about the total ordering. (BTW, I cannot see a proper spec in LogicalTypes.md. It is for the other PR, though.)

Could you please double-check the licensing of org.locationtech.jts whether we are able to distribute parquet-java as is? I am not expert of this, though.

I think, it is not a good idea to include external objects (org.locationtech.jts.geom.Geometry) into our public API. In our public API we should stick to the primitive type (Binary).

Otherwise, it looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gszadovszky for a quick review! Please see my inline comments below.

I can see a TODO for the comparator but I cannot see any specification in the format PR about the total ordering.

I thought we'd better make the ColumnOrder of geometry logical type as undefined and do not populate min/max values. Here is the related line in the format PR: https://github.com/apache/parquet-format/pull/240/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1083

Could you please double-check the licensing of org.locationtech.jts whether we are able to distribute parquet-java as is?

It has dual licenses:

I have also checked Apache Sedona which includes org.locationtech.jts/jts-core in its dependency: https://github.com/apache/sedona/blob/119682321f5301e6ef776315a8f2cd53a8ff63b0/pom.xml#L168-L188. So I assume we are safe here.

@jiayuasu may provide more context as the domain expert.

I think, it is not a good idea to include external objects (org.locationtech.jts.geom.Geometry) into our public API. In our public API we should stick to the primitive type (Binary).

I agree that it is not a good idea to expose org.locationtech.jts.geom.Geometry as a public API. Let me switch it to internal method for now. If downstream projects like Apache Iceberg (which has customized column writer implementation) could benefit from directly passing a org.locationtech.jts.geom.Geometry to build stats without unnecessary conversion, we can revisit this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answers, @wgtmac!
The TODO comment about the comparator can be removed, then. With the moving of the Geometry references into non-public scopes I'm OK with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know what do you think

This looks good to me! (I left some geometry-specific notes on the implementation)

I have also checked Apache Sedona which includes org.locationtech.jts/jts-core in its dependency

I believe Calcite also uses JTS for its spatial component and it may be worth checking there to see to what extent those types are used in various APIs and/or if it's ever caused any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty confident that JTS is fully compatible with ASF license: https://www.apache.org/legal/resolved.html

This is also why Sedona added it years ago


public class GeometryStatistics {

private final BoundingBox boundingBox;
Copy link
Member Author

@wgtmac wgtmac Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeometryStatistics is something that I can hardly find a way to fit into an extension type. Please let me know if you have any better idea. @pitrou

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look through for geometry things...this is very cool!

if (typeId == UNKNOWN_TYPE_ID) {
return UNKNOWN_TYPE_ID;
}
Coordinate coordinate = geometry.getCoordinate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that a Geometry doesn't have a getDimensions or similar

Comment on lines +87 to +89
for (Coordinate coordinate : coordinates) {
update(coordinate.getX(), coordinate.getY(), coordinate.getZ(), coordinate.getM());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that JTS doesn't have an optimized method for this (it almost certainly has at least an internal one for building indexes internally)

Comment on lines 837 to 838
formatStats.setGeometry_types(
new ArrayList<Integer>(stats.getGeometryTypes().getTypes()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth sorting this here (if the hash set used in the geometry types has non-deterministic ordering)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Fixed.

try {
if (geometry != EMPTY) {
Geometry envelope = reader.read(geometry.array());
geometry = ByteBuffer.wrap(writer.write(envelope.union(geom).getEnvelope()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little unclear as to the utility of the EnvelopeCovering...it seems like the optimal strategy would be to always merge BoundingBoxes and build a geometry at the end. A union in a loop is not usually efficient and should perhaps be more difficult to accidentally do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Perhaps I can call Envelope.expandToInclude(Envelope) to maintain a running envelope.

https://locationtech.github.io/jts/javadoc/org/locationtech/jts/geom/Envelope.html#expandToInclude-org.locationtech.jts.geom.Envelope-

@wgtmac
Copy link
Member Author

wgtmac commented Jun 22, 2024

Thanks @paleolimbot for the quick review! I've addressed your feedback and also added page index support for geometry stats. BTW, it is a convention that we need PoC on another Parquet implementation to move forward the proposal. I plan to add similar things to parquet-cpp. Do you happen to know any JTS-parity on the C++ side? Or do you have a plan to implement the PoC? I guess there should be a lot of existing code on the GeoParquet side.

@jiayuasu
Copy link
Member

@wgtmac The JTS-parity on C++ side is GEOS, which is a C++ port of JTS: https://github.com/libgeos/geos

Note that: GEOS is LGPL 2.1, which is NOT compatible with ASF

return mMax;
}

void update(Geometry geom) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way to update Geometry in JTS. You need to use CoordinateSequenceFilter. Sedona has many examples here: https://github.com/apache/sedona/blob/0a1db3d35fd7b5721301aa3ce5d39aa8cd828661/common/src/main/java/org/apache/sedona/common/Functions.java#L294

public class BoundingBox {

private double xMin = Double.MAX_VALUE;
private double xMax = Double.MIN_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the init value of xMax, yMax, ..., please use -Double.MAX_VALUE because Double.MIN_VALUE is not guaranteed to be negative. See https://stackoverflow.com/questions/3884793/why-is-double-min-value-in-not-negative

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double.POSITIVE_INFINITY and Double.NEGATIVE_INFINITY are also valid choices for initial min/max values.

For extracting the bounds for each ordinate, org.locationtech.jts.io.twkb.BoundsExtractor in the JTS code base is also a good example.

new Coordinate(envelope.getMaxX(), envelope.getMinY()),
new Coordinate(envelope.getMinX(), envelope.getMinY())
});
geometry = ByteBuffer.wrap(writer.write(polygon));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay attention to a trap here: JTS WKB writer has a bug that does not write M value. There is a PR on JTS to fix this but it hasn't been merged: locationtech/jts#734.

I think we also need to be a bit more explicit here about which WKB standard we are supporting.

Standard WKB format does not allow Z and M dimension. ISO WKB allow Z and M dimension. JTS follows ISO WKB but its writer has the aforementioned bug. See this article from GEOS: https://libgeos.org/specifications/wkb/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advise! BTW, it seems that JTS envelope has only X and Y axises. I'm not sure if this is another issue here.

@jiayuasu
Copy link
Member

Core sedona contributors @Kontinuation and @zhangfengcdt , can you also review this?

@wgtmac Thanks for putting together this PR. I guess you might need some help from the Sedona community regarding the Java geometry implementation together with JTS. We can work together to implement this POC.

@wgtmac
Copy link
Member Author

wgtmac commented Jun 23, 2024

Thanks @jiayuasu for your valuable feedback! Yes, it would be great if we can get direct help from geospatial experts!

@paleolimbot
Copy link
Member

Do you happen to know any JTS-parity on the C++ side?

As Jia mentioned, it's GEOS, but this is not something the C++ implementation can/should take on as a dependency.

I guess there should be a lot of existing code on the GeoParquet side.

GDAL has quite a lot to draw from ( https://github.com/OSGeo/gdal/tree/master/ogr/ogrsf_frmts/parquet ), although I think pretty much all we need to do is walk WKB for bounding boxes and geometry type ( https://github.com/OSGeo/gdal/blob/12dab86ca2d8b1a65c4c085e137c62294682ac1d/ogr/ogr_wkb.cpp#L387-L584 ).

Or do you have a plan to implement the PoC?

I am happy to give it a shot (I'll make a start this week).

public class BoundingBox {

private double xMin = Double.MAX_VALUE;
private double xMax = Double.MIN_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double.POSITIVE_INFINITY and Double.NEGATIVE_INFINITY are also valid choices for initial min/max values.

For extracting the bounds for each ordinate, org.locationtech.jts.io.twkb.BoundsExtractor in the JTS code base is also a good example.

Comment on lines +51 to +57
Geometry polygon = factory.createPolygon(new Coordinate[] {
new Coordinate(envelope.getMinX(), envelope.getMinY()),
new Coordinate(envelope.getMinX(), envelope.getMaxY()),
new Coordinate(envelope.getMaxX(), envelope.getMaxY()),
new Coordinate(envelope.getMaxX(), envelope.getMinY()),
new Coordinate(envelope.getMinX(), envelope.getMinY())
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, EnvelopeCovering repliates BoundingBox, so can we reuse BoundingBox here to handle all ordinates correctly?

If we still use geom.getEnvelopeInternal, we can replace this with a shorter version factory.toGeometry(envelope).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnvelopeCovering is just a possible implementation of the Covering. The idea behind putting Covering and BoundingBox at the same time originates from the suggestion of apache/parquet-format#240 (comment) and apache/parquet-format#240 (comment)

Comment on lines +1162 to +1169
if (crs != null && !crs.isEmpty()) {
sb.append(",");
sb.append(crs);
}
if (metadata != null) {
sb.append(",");
sb.append(metadata);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a high probability that crs itself contains comma, so this may introduce ambiguity to the generated type parameters.

metadata will be appended as something like java.nio.HeapByteBuffer[pos=_ lim=_ cap=_] to the type parameter. Does this have any bad implications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely something that I need to improve :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants