Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

The initial version of converting ArrowArray (or stream of them) to R is implemented in C and is difficult to understand. Not only is this difficult because of the verbose C, the dispatch portion is implemented almost completely twice (once for a single array, once for an array stream). It is at a point currently where it is difficult for me, let alone an external contributor, to add features or fix bugs. Time to refactor!

This approach uses C++ classes/virtual method dispatch to handle the different types of vector conversions. This is similar to how the arrow R package does this except the Arrow R package uses the Arrow C++ converter infrastructure/heavy templating to do dispatch. Here we use a switch() and eat the per-batch and per-column virtual method call.

Work in progress!

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 35.44474% with 479 lines in your changes are missing coverage. Please review.

Project coverage is 86.09%. Comparing base (c7a1236) to head (2afc7b8).

Files Patch % Lines
r/src/vctr_builder_int.h 2.98% 65 Missing ⚠️
r/src/vctr_builder_int64.h 0.00% 59 Missing ⚠️
r/src/vctr_builder_dbl.h 10.52% 51 Missing ⚠️
r/src/vctr_builder.cc 73.21% 45 Missing ⚠️
r/src/vctr_builder_difftime.h 37.50% 40 Missing ⚠️
r/src/vctr_builder_other.h 11.11% 40 Missing ⚠️
r/src/vctr_builder_base.h 30.35% 39 Missing ⚠️
r/src/vctr_builder_lgl.h 5.26% 36 Missing ⚠️
r/src/vctr_builder_chr.h 5.88% 32 Missing ⚠️
r/src/vctr_builder_blob.h 4.00% 24 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #392      +/-   ##
==========================================
- Coverage   88.74%   86.09%   -2.66%     
==========================================
  Files          81       97      +16     
  Lines       14398    15091     +693     
==========================================
+ Hits        12778    12993     +215     
- Misses       1620     2098     +478     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paleolimbot added a commit that referenced this pull request May 15, 2024
…461)

This PR adds the `nanoarrow_vctr`, which is an R translation of the
Python `Array` class in nanoarrow's Python bindings. This is implemented
like an R `factor()` in the sense that under the hood it is a sequence
of integers (`0:(array$length - 1)` at the beginning) with attributes
that give those integers context.

This is implemented in such a way that it is "tacked on" to the existing
conversions. The existing conversions do need a refactoring (
#392 ), but that is a
heavy change for this point in the release cycle.

The only change needed to the existing conversion was a slight refactor
of the "consume array stream" code that correctly gave each array in the
stream its own R object to manage its lifecycle (before each array was
"materialized" and then immediately released because no previous
conversion code required an ArrowArray to live beyond the conversion.

The motivation for this change is converting GeoArrow extension types.
In the geoarrow package, we implement an efficient conversion from a
stream of arrays to various types of R-spatial objects (e.g., sf);
however, we really don't want to invoke the default conversion for those
types because they have awful performance (e.g., the multipolygon would
be a `list(list(list(data.frame))))`) and there's no need to invoke that
number of R object conversions between the initial state (an arrow
array) and the final state (an sfc column). The nanoarrow_vctr allows
something like:

```r
df <- convert_array(some_array_containing_a_geoarrow_col)
st_as_sfc(df$geometry)  # or s2::as_s2_geography(df$geometry), or something else
```

A side-effect of this change is that we have an escape hatch for
conversions that are lossy or contain types with no R equivalent.

A quick demo:

``` r
library(nanoarrow)

arrays <- lapply(
  list(1:5, 6:10, 11:13),
  as_nanoarrow_array
)

# A vctr can be created from any stream
(vctr <- as_nanoarrow_vctr(basic_array_stream(arrays)))
#> <nanoarrow_vctr int32[13]>
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13

# Under the hood this is something like a factor() where levels are
# a list of arrays with cached offsets. This is like an Arrow ChunkedArray
str(vctr)
#> <nanoarrow_vctr int32[13]>
#> List of 3
#>  $ :<nanoarrow_array int32[5]>
#>   ..$ length    : int 5
#>   ..$ null_count: int 0
#>   ..$ offset    : int 0
#>   ..$ buffers   :List of 2
#>   .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#>   ..$ dictionary: NULL
#>   ..$ children  : list()
#>  $ :<nanoarrow_array int32[5]>
#>   ..$ length    : int 5
#>   ..$ null_count: int 0
#>   ..$ offset    : int 0
#>   ..$ buffers   :List of 2
#>   .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `6 7 8 9 10`
#>   ..$ dictionary: NULL
#>   ..$ children  : list()
#>  $ :<nanoarrow_array int32[3]>
#>   ..$ length    : int 3
#>   ..$ null_count: int 0
#>   ..$ offset    : int 0
#>   ..$ buffers   :List of 2
#>   .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. ..$ :<nanoarrow_buffer data<int32>[3][12 b]> `11 12 13`
#>   ..$ dictionary: NULL
#>   ..$ children  : list()

# vctrs can be sliced:
head(vctr)
#> <nanoarrow_vctr int32[6]>
#> [1] 1 2 3 4 5 6

# ...and can live in a data.frame
head(tibble::tibble(x = vctr))
#> # A tibble: 6 × 1
#>   x         
#>   <nnrrw_vc>
#> 1 1         
#> 2 2         
#> 3 3         
#> 4 4         
#> 5 5         
#> 6 6

# They can be used as zero-copy conversion targets
array <- as_nanoarrow_array(1:5)
convert_array(array, nanoarrow_vctr())
#> <nanoarrow_vctr int32[5]>
#> [1] 1 2 3 4 5

# ...also works in a nested ptype
array <- as_nanoarrow_array(data.frame(x = 1:5))
convert_array(array, tibble::tibble(x = nanoarrow_vctr()))
#> # A tibble: 5 × 1
#>   x         
#>   <nnrrw_vc>
#> 1 1         
#> 2 2         
#> 3 3         
#> 4 4         
#> 5 5

# For nested list output, it will give a slice of the original array for
# each list item
array <- as_nanoarrow_array(
  list(1:5, 6:10, NULL, 11:13),
  schema = na_list(na_int32())
)

(lst_of <- convert_array(array, vctrs::list_of(nanoarrow_vctr())))
#> <list_of<nanoarrow_vctr>[4]>
#> [[1]]
#> <nanoarrow_vctr int32[5]>
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> <nanoarrow_vctr int32[5]>
#> [1]  6  7  8  9 10
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> <nanoarrow_vctr int32[3]>
#> [1] 11 12 13
for (i in seq_along(lst_of)) {
  array <- attr(lst_of[[i]], "chunks")[[1]]
  cat(sprintf("offset: %d, length: %d\n", array$offset, array$length))
}
#> offset: 0, length: 5
#> offset: 5, length: 5
#> offset: 10, length: 3
```

<sup>Created on 2024-05-10 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants