Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow instead of arrow2 in the example #518

Merged
merged 1 commit into from
Jan 29, 2024
Merged

Conversation

lnicola
Copy link
Member

@lnicola lnicola commented Jan 27, 2024

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

Closes #516

@lnicola
Copy link
Member Author

lnicola commented Jan 27, 2024

r? @kylebarron

@lnicola lnicola changed the title Use arrow instead of arrow2 in the example Use arrow instead of arrow2 in the example Jan 27, 2024
Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'll make a PR to connect to this from geoarrow-rs too, and then I can have an example of reading a full layer into a GeoArrow table.

//! implementation of Arrow.
//! As of this writing (Jan 2024), there are two competing low-level Arrow libraries in Rust.
//! [`arrow`](https://github.com/apache/arrow-rs) is the "official" one, while
//! [`arrow2`](https://github.com/jorgecarleitao/arrow2) is a less active alternative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a link here

Suggested change
//! [`arrow2`](https://github.com/jorgecarleitao/arrow2) is a less active alternative.
//! [`arrow2`](https://github.com/jorgecarleitao/arrow2) is a [less active](https://github.com/jorgecarleitao/arrow2/issues/1429) alternative.


#[cfg(any(major_ge_4, all(major_is_3, minor_ge_6)))]
fn run() -> gdal::errors::Result<()> {
use arrow2::array::{BinaryArray, StructArray};
use arrow2::datatypes::DataType;
use arrow::array::{Array as _, BinaryArray};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what's the benefit of as _? Is that to signify that it isn't used as an object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it lets you call the trait methods, but doesn't import the Array name. Makes it a little clearer that it's a trait IMHO.

Comment on lines +52 to +57
let geom_column_index = arrow_stream_reader
.schema()
.column_with_name("wkb_geometry")
.unwrap()
.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the most reliable way to access the geometry column is. Another option is to look for the ogc.wkb extension name on the geometry field.

As of GDAL 3.8 the user can also pass in GEOMETRY_METADATA_ENCODING=GEOARROW (see doc). That assigns geoarrow.wkb to the extension name instead of ogc.wkb, but also includes the CRS as PROJJSON onto the Arrow extension metadata, so the user doesn't have to separately fetch the layer's CRS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious: can GDAL ever read two geometry columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL OGR_L_GetGeometryColumn.

I'm also curious: can GDAL ever read two geometry columns?

Normally yes, not sure about Arrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL OGR_L_GetGeometryColumn.

Yeah... but in general with GeoArrow we need to have a manner outside the GDAL-specific API to know the geometry column.

Normally yes, not sure about Arrow.

Interesting. I should test GDAL on a file with multiple geometries and see what happens

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... but in general with GeoArrow we need to have a manner outside the GDAL-specific API to know the geometry column.

Well, I don't know much about Arrow, but that sounds like something that GeoArrow should clarify, not us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GeoArrow the preferred way to do this is via the extension metadata, not a column's name. I just didn't recall exactly how GDAL implements this, and whether simply checking for the wkb_geometry name is sufficient

Comment on lines +66 to +68
// Downcast it to a `BinaryArray`
let binary_array = geom_column.as_any().downcast_ref::<BinaryArray>().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the exact circumstances but I thought that sometimes GDAL would create an Arrow array with i64 offsets instead of i32 offsets. Maybe this is only in the rare case where the array is very large? I thought I hit a bug related to this, but it was probably in my own code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, why is that even a thing 😕. I guess the offset size is part of the data type?

Anyway, we're keeping the old behavior, so I'm not sure how to improve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as-is is fine to merge.

With Arrow, a list array has a values buffer plus a buffer of offsets. If the values buffer has more than 2^31 items (signed ints to support Java) then you can't represent those values. Usually the preferred way to fix this is to use batching, so that you never have 2^31 values in a single contiguous array, but Arrow also defines "large" nested array types, which use int64 offsets instead of int32.

I don't recall if GDAL will ever create an array with int64 offsets

@lnicola lnicola merged commit 1b8ba56 into georust:master Jan 29, 2024
9 checks passed
@lnicola lnicola deleted the arrow branch January 29, 2024 17:03
@kylebarron
Copy link
Contributor

Thanks!

kylebarron added a commit to geoarrow/geoarrow-rs that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update OGR arrow example to use arrow-rs
2 participants