-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is there any scope/appetite for supporting multiple .SD and .SDcols?
Motivation: I frequently encounter situations where I need to perform different aggregation tasks on distinct column groups. One group of columns will be aggregated as means, another group will be aggregated as medians, yet another group will be aggregated as sums, etc. In these cases, only one group can be passed through the convenience features of .SD(cols), while the other group(s) must all be aggregated manually.
Here's a simple (and somewhat ill-advised) example that illustrates the mechanics:
library(data.table)
d = as.data.table(iris)
d[,
c(lapply(.SD, sum),
list(Petal.Length = mean(Petal.Length), Petal.Width = mean(Petal.Width))),
.SDcols = patterns("^Sepal"),
by = Species]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 250.3 171.4 1.462 0.246
#> 2: versicolor 296.8 138.5 4.260 1.326
#> 3: virginica 329.4 148.7 5.552 2.026Here the summed Sepal columns get the .SD convenience treatment, while I have to manually take the mean of the Petal columns separately (and name them; see also #1227 (comment)).
My proposal is to allow something like this instead:
d[,
c(lapply(.SD, sum), lapply(.SD2, mean)),
.SDcols = patterns("^Sepal"), .SDcols2 = patterns("^Petal"),
by = Species]I'm assuming here that you can match the relevant subsets based on the index (.SD2 => .SDcols2). If this is easy to do for one additional subset, then in principle it seems possible for any additional subsets (.SD3 => .SDcols3, etc). Of course, this may impose some small overhead that doesn't pass the cost-benefit test.
Feel free to close if this seems undesirable / too much work. Thanks for considering!
Update. Related: #1063 (comment) and possibly #4970
Proposed solution in the former is:
s_cols = grep("^Sepal", names(d), value = TRUE)
p_cols = grep("^Petal", names(d), value = TRUE)
d[,
{
SD = unclass(.SD)
c(lapply(SD[s_cols], sum), lapply(SD[p_cols], mean))
},
.SDcols = c(s_cols, p_cols),
by = Species]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 250.3 171.4 1.462 0.246
#> 2: versicolor 296.8 138.5 4.260 1.326
#> 3: virginica 329.4 148.7 5.552 2.026SessionInfo
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.13.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base