diff --git a/man/data.table.Rd b/man/data.table.Rd index 658c57234..da932047b 100644 --- a/man/data.table.Rd +++ b/man/data.table.Rd @@ -111,7 +111,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac \item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]} } - \emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}. + \emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. Note that for rows in \code{i} with no match, the group of matching rows in \code{x} is empty. Special symbols that operate on rows (e.g., \code{.I} or \code{.N}) will therefore evaluate to \code{0} for such groups. This differs from selecting a column from \code{x} (e.g., \code{x$col}), which results in \code{NA} as governed by the \code{nomatch} argument. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}. \emph{Advanced:} In the \code{X[Y, j]} form of grouping, the \code{j} expression sees variables in \code{X} first, then \code{Y}. We call this \emph{join inherited scope}. If the variable is not in \code{X} or \code{Y} then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.} @@ -320,6 +320,13 @@ DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.() +#' # Why .I is 0 for non-matching rows with by=.EACHI: +#' d1 = data.table(v = c("A", "B", "C", "A", "C"), val = 1:5) +#' d2 = data.table(v = c("D", "A", "G", "C")) +#' # Selecting a column 'val' returns NA for non-matches, per `nomatch=NA` +#' d1[d2, on = .(v), .(val), by = .EACHI] +#' d1[d2, on = .(v), .I, by = .EACHI] + # joins as subsets X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)) X diff --git a/vignettes/datatable-joins.Rmd b/vignettes/datatable-joins.Rmd index 3d7cf8c5c..606b7f8b9 100644 --- a/vignettes/datatable-joins.Rmd +++ b/vignettes/datatable-joins.Rmd @@ -259,6 +259,42 @@ dt2 = ProductReceived[ identical(dt1, dt2) ``` +##### Understanding `j` Evaluation with `by=.EACHI` for Non-Matches + +A common point of confusion arises when using special symbols like `.I` in `j` with `by=.EACHI`. The behavior for non-matching rows differs from what you might expect when selecting a regular column. + +Let's illustrate with a simple example: +```{r by-eachi-special-symbols} +d1 = data.table(v = c("A", "B", "C", "A", "C"), i_col = 1:5) +d2 = data.table(v = c("D", "A", "G", "C")) +``` + +*Case 1: Selecting a regular column* + +When we select a column from `x (d1)`, non-matching rows from `i (d2)` result in `NA`. This is the standard behavior governed by `nomatch = NA`. +```{r} +d1[d2, on = .(v), .(i_col), by = .EACHI] +``` + +For the rows `D` and `G` in `d2`, there is no matching row in `d1`, so the value for `i_col` is missing `(NA)`. + +*Case 2: Evaluating the special symbol `.I`* + +However, when we use the special symbol `.I`, non-matching rows evaluate to `0`. +```{r} +d1[d2, on = .(v), .I, by = .EACHI] +``` + +The reason for this difference is crucial: +- In Case 1, we are performing a value lookup. A failed lookup results in a missing value (`NA`). +- In Case 2, we are performing an evaluation. The symbol `.I` is defined as "the row indices in `x` for the current group". For non-matching rows like `D`, the group of matching rows in d1 is empty. The set of indices for an empty group is integer(0). data.table represents this zero-length result as a single `0` in the output. + +This logic is consistent with other special symbols like `.N` (the number of rows in a group), which also correctly evaluates to `0` for non-matching groups. + +```{r} +d1[d2, on = .(v), .N, by = .EACHI] +``` + #### 3.1.4. Joining based on several columns So far we have just joined `data.table`s based on 1 column, but it's important to know that the package can join tables matching several columns.