Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R]: geoarrow_vctr class #70

Open
anthonynorth opened this issue Oct 26, 2023 · 17 comments
Open

[R]: geoarrow_vctr class #70

anthonynorth opened this issue Oct 26, 2023 · 17 comments

Comments

@anthonynorth
Copy link
Contributor

Add minimal geoarrow_vctr class for geoarrow vectors. Enables integration with other packages wk, sf, geos etc.

See #69 (comment)

@anthonynorth
Copy link
Contributor Author

I can make a start on this piece if you're taking PRs.

Are we implementing a wk_vctr or are we taking a dependency on {vctrs} and implementing a vctrs_vctr?

Should sf::st_as_sf.geoarrow_vctr etc be provided by this package, similar to what was done in {wk}?

@paleolimbot
Copy link
Contributor

I would love a PR! The basic minimum is something that can be put into a data.frame() or tibble() and implements wk_handle() + st_as_sfc() + as_nanoarrow_array_stream() (i.e., don't worry about subsetting if that causes grief).

I would probably use wk_vctr for now unless you run into problems. There are some low-level spatial developers that are in the 'no dependencies ever' camp and it doesn't take much overhead to implement the basics.

I think sf::st_as_sf.geoarrow_vctr() should live in this package for now.

@paleolimbot
Copy link
Contributor

Are you still interested in having a stab at this? I have some time to take a look at this this week and am happy to give it a go (but also happy to defer to any work you've done already if you're interested 🙂 )

@anthonynorth
Copy link
Contributor Author

anthonynorth commented Nov 8, 2023

I had intended to action this (almost) 2 weeks ago.

I am still interested in having a stab at this, but I've been focused on other projects.

I haven't implemented anything yet.

@paleolimbot
Copy link
Contributor

No worries! I haven't gotten to it yet either. Just post a note here if/when you do (I'll do the same).

@anthonynorth
Copy link
Contributor Author

Should geoarrow() replicate the behaviour in paleolimbot/geoarrow, where geoarrow::geoarrow() can accept any handleable? This makes geoarrow() pretty much an alias of as_geoarrow().

Or, should geoarrow() instead require a geoarrow array? This would be similar behaviour to the vectors in {wk}.

# untested example for wk-like
# need a suitable default for x
geoarrow <- function(x = as_geoarrow_array(wk::wkb()), crs = wk::wk_crs_auto(), geodesic = FALSE) {
  validate_geoarrow_array(x) # doesn't exist
  new_geoarrow(x)
}

new_geoarrow <- function(x = as_geoarrow_array(wk::wkb()), crs = wk::wk_crs_auto(), geodesic = FALSE) {
  structure(x, class = c("geoarrow_vctr", "wk_vctr"), crs = crs, geodesic = geodesic)
}

as_geoarrow <- function(x) {
  new_geoarrow(as_geoarrow_array(x), wk::wk_crs(x), wk::wk_is_geodesic(x))
}

@paleolimbot
Copy link
Contributor

Or, should geoarrow() instead require a geoarrow array?

I think the stricter option is better, here, or you could omit the function since I don't know exactly how it would get used. It might be idiomatic to do something like vec_cast(something, geoarrow()), but even then it's not quite enough because a cast would require a stronger type?

@anthonynorth
Copy link
Contributor Author

anthonynorth commented Nov 15, 2023

I think the stricter option is better

I agree.

A couple of usecases for a geoarrow() constructor:

  • wk::wk_translate(something, geoarrow())
  • vctrs::vec-_cast(something geoarrow()) # your example above
  • Converting a geoarrow array in an arrow table, to a geoarrow_vctr. Doable with as_geoarrow(), but this should just work with geoarrow() and new_geoarrow()

A little off topic: a minimal implementation should be printable. I think we need to subset the geoarrow_vctr to achieve this without wasting a lot of memory?

@paleolimbot
Copy link
Contributor

Those uses, do make sense...maybe it should be restricted to creating just a zero-size version for the "make me a ptype" use case?

I think we need to subset the geoarrow_vctr to achieve this without wasting a lot of memory?

It would require a slice, anyway. You can use nanoarrow_array_modify() to make a shallow copy that reduces the length of whatever the last chunk is. Feel free to punt that to another PR...I have a few functions I started on while on a plane a few weeks ago that may help. I'll try to put them up as a draft PR shortly.

@paleolimbot
Copy link
Contributor

Ok, #75 . It's almost all about how to resolve a chunk using a binary search and various ways to access that. Maybe it would help if I polished up that and removed the vctr bits?

@anthonynorth
Copy link
Contributor Author

Those uses, do make sense...maybe it should be restricted to creating just a zero-size version for the "make me a ptype" use case?

Maybe. I don't have a strong opinion on it. This is the same pattern used in s2 and geos.

Ok, #75 . It's almost all about how to resolve a chunk using a binary search and various ways to access that. Maybe it would help if I polished up that and removed the vctr bits?

I skimmed and saw only 1 trivial bug that can't be reached with the public api currently. I don't know that we need to remove the vctr bits. We can build from there

@anthonynorth
Copy link
Contributor Author

How should we deal with non-contiguous slices? We'll need this for data.frame manipulation.

Should we shallow copy, allowing indices to be any integer vector within range? I think this could potentially create many nanoarrow::nanoarrow_array_modify() chunks.

Or should we clone the slice?

vec <- geoarrow::as_geoarrow_vctr(wk::xy(1:5, 1:5))
vec[c(1, 3)]
#> Error in `[.geoarrow_vctr`(vec, c(1, 3)): Can't subset geoarrow_vctr with non-slice (e.g., only i:j indexing is supported)

Created on 2023-12-05 with reprex v2.0.2

@paleolimbot
Copy link
Contributor

How should we deal with non-contiguous slices?

The easiest way would be to convert to an arrow::ChunkedArray, subset that, and then recreate the vctr. I think that's what the previous version did. Someday I might get to implementing a filter or take specific to geoarrow-encoded arrays to remove the optional arrow dependency (or maybe we can use something wrapping arrow-rs to do the subset).

We'll need this for data.frame manipulation.

Maybe...it could also just be that it's a placeholder for now (although maybe the error message should reflect that: "must cast geoarrow_vctr column to native vector type before subsetting" or something). Notably, it should "just work" when somebody calls st_as_sf() on a data.frame containing one of those columns (but this has to exist on CRAN before I can PR that into sf).

@paleolimbot
Copy link
Contributor

Or, as you noted, which would totally work, you could walk the indices searching for the minimum number of slices that can represent the new indices, which is potentially a very large number of slices. That will probably be slow and I would maybe rather error until it can be fast!

@anthonynorth
Copy link
Contributor Author

anthonynorth commented Dec 6, 2023

I agree, walking the indices and making slices is very complicated and potentially more costly than a copy.

An optional dependency on arrow is probably safe. You could create a geoarrow_vctr without having arrow installed, but you'd have to start from something in R first to do that, I think? Seems unlikely to me.

Somewhat related and a bit of a left-field idea, is there value in a wk::wk_handle_indices() which could support this usecase in geoarrow, plus other formats?

wk_handle_indices(handleable, geoarrow_writer(), indices) |>
  as_geoarrow_vctr()

@paleolimbot
Copy link
Contributor

Is there value in a wk::wk_handle_indices() which could support this usecase in geoarrow, plus other formats?

Possibly! The previous incarnation of geoarrow did something like that: https://github.com/paleolimbot/geoarrow/blob/master/src/geoarrow-handle-wk.cpp#L290-L315 . I am not sure in retrospect that it was a good idea for the specific use case of the vctr...it did allow for randomly sorting a vctr, but it also kept possibly large quantities of data from being freed until the "take" operation was resolved somehow. I think R users are more likely to be able to reason about an eager "take".

@anthonynorth
Copy link
Contributor Author

I had imagined wk_handle_indices() would be performing a copy. The difficulty here, I think, is that wk readers are reading in order and we don't want that here. Re-writing readers to allow for arbitrary indices is a big change, I think?

My initial thinking was to implement this via a buffered filter, which holds a copy of each geometry in indices until we've read enough to pass onto the next handler. We only need to buffer while there's geometries we haven't yet seen which must be passed before what we have read (e.g. c(5, 1)). Seems to be a similar idea to this, just with wk flavour.

Worst case (reverse sort) this is 2 copies of the vector, so quite inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants