Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support translations of simple data.table functions #45098

Open
MichaelChirico opened this issue Dec 23, 2024 · 1 comment
Open

Support translations of simple data.table functions #45098

MichaelChirico opened this issue Dec 23, 2024 · 1 comment

Comments

@MichaelChirico
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Related: #39822.
Context: MichaelChirico/funchir#18

Goal: acero engine translates some {data.table} functions into {arrow} executions.

#39822 is much more ambitious since it requires implementing translation for the most complex part of the {data.table} API, namely [.

This request should be much simpler to satisfy -- at a minimum, translate fcase(), but we should also be able to do fifelse() and fcoalesce(). These have {dplyr} equivalents case_when(), if_else(), and coalesce(), respectively.

My assumption here is that acero works by static analysis -- read the AST, apply known translations, i.e. analogous to {dbplyr}. If there's something deeper going on then this is just a duplicate of #39822.

I can help prepare a PR but don't really have a machine suitable for that available until next year. Filing this issue first as a sanity check & to gauge interest.

Component(s)

R

@jonkeane
Copy link
Member

My assumption here is that acero works by static analysis -- read the AST, apply known translations, i.e. analogous to {dbplyr}

Yup, at a high level this is what's going on. You might have already found these, but here are some pointers around case_when that might be helpful: we register the binding which does some validation and then uses arrow_eval each of the case formulas. This arrow_eval is to create arrow expressions (you might not need it if you're not operating on expressions that reference columns in a data.frame(-like) object. Then the arrow expression itself is returned](

Expression$create(
"case_when",
args = c(
Expression$create(
"make_struct",
args = query,
options = list(field_names = as.character(seq_along(query)))
),
value
)
)
). If fcase() don't need the tidy evaluation semantics for selecting columns, the binding might be as simple as just using a similar Expression$create() on the expressions themselves (there's probably a bit more to that, but much of the complication with the dplyr bindings is getting the tidy evaluation working).

The test for case_when use a helper that is called compare_dplyr_binding, which in this case is a little ill-named, but should work, so long as the input data.frame is .input in the test code.

All of that said, because we are looking through the AST to find bindings, if fcase had an AST that aws expressions that already had bindings (either base R or even a remap to case_when) in this circumstance, it should work with arrow without even needing any changes in the arrow package itself.

Something like the following should work:

fcase_arrow <- function(..., default = NA) {
  dots <- rlang::list2(...)
  # for each pair of arguments, make a formula expression
  formulae <- lapply(seq(1, length(dots), by = 2), function(i) rlang::expr(dots[[!!i]] ~ dots[[!!(i + 1)]]))
  dplyr::case_when(!!!formulae, .default = default)
}

You could build up the acero with Expression$create("case_when", ...) expressions rather than using rlang's expression and dots management there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants