Skip to content

Passing named lists to .SDcols / .SD #5020

@grantmcdermott

Description

@grantmcdermott

Is there any scope/appetite for supporting multiple .SD and .SDcols?

Motivation: I frequently encounter situations where I need to perform different aggregation tasks on distinct column groups. One group of columns will be aggregated as means, another group will be aggregated as medians, yet another group will be aggregated as sums, etc. In these cases, only one group can be passed through the convenience features of .SD(cols), while the other group(s) must all be aggregated manually.

Here's a simple (and somewhat ill-advised) example that illustrates the mechanics:

library(data.table)
d = as.data.table(iris)

d[, 
  c(lapply(.SD, sum),
    list(Petal.Length = mean(Petal.Length), Petal.Width = mean(Petal.Width))), 
 .SDcols = patterns("^Sepal"), 
 by = Species]
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:     setosa        250.3       171.4        1.462       0.246
#> 2: versicolor        296.8       138.5        4.260       1.326
#> 3:  virginica        329.4       148.7        5.552       2.026

Here the summed Sepal columns get the .SD convenience treatment, while I have to manually take the mean of the Petal columns separately (and name them; see also #1227 (comment)).

My proposal is to allow something like this instead:

d[, 
  c(lapply(.SD, sum), lapply(.SD2, mean)), 
  .SDcols = patterns("^Sepal"), .SDcols2 = patterns("^Petal"),
  by = Species]

I'm assuming here that you can match the relevant subsets based on the index (.SD2 => .SDcols2). If this is easy to do for one additional subset, then in principle it seems possible for any additional subsets (.SD3 => .SDcols3, etc). Of course, this may impose some small overhead that doesn't pass the cost-benefit test.

Feel free to close if this seems undesirable / too much work. Thanks for considering!

Update. Related: #1063 (comment) and possibly #4970

Proposed solution in the former is:

s_cols = grep("^Sepal", names(d), value = TRUE)
p_cols = grep("^Petal", names(d), value = TRUE)
d[, 
  {
   SD = unclass(.SD)
   c(lapply(SD[s_cols], sum), lapply(SD[p_cols], mean))
  },
  .SDcols = c(s_cols, p_cols),
  by = Species]
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:     setosa        250.3       171.4        1.462       0.246
#> 2: versicolor        296.8       138.5        4.260       1.326
#> 3:  virginica        329.4       148.7        5.552       2.026
SessionInfo

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

Metadata

Metadata

Assignees

No one assigned

    Labels

    programmingparameterizing queries: get, mget, eval, envtop requestOne of our most-requested issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions