From d0a4f3732f4e742a2506f797dcfb7bfa6e5ce2ac Mon Sep 17 00:00:00 2001 From: Toby Robertson Date: Tue, 22 Jul 2025 20:40:51 +0100 Subject: [PATCH 1/6] "Revise ?mergelist Details re. join-from/join-to (fixes #7190)" --- man/mergelist.Rd | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index 56e0d8703a..fa1d786acd 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -26,17 +26,22 @@ Merging is performed sequentially from "left to right", so that for \code{l} of 3 tables, it will do something like \code{merge(merge(l[[1L]], l[[2L]]), l[[3L]])}. \emph{Non-equi joins} are not supported. Column names to merge on must be common in both tables on each merge. - Arguments \code{on}, \code{how}, \code{mult}, \code{join.many} could be lists as well, each of length \code{length(l)-1L}, to provide argument to be used for each single tables pair to merge, see examples. + Arguments \code{on}, \code{how}, \code{mult}, \code{join.many} may also be lists, each of length \code{length(l)-1L}, providing the argument to be used at each successive merge; see examples. - The terms \emph{join-to} and \emph{join-from} indicate which in a pair of tables is the "baseline" or "authoritative" source -- this governs the ordering of rows and columns. + Heuristically speaking, the \emph{join-from} table searches the \emph{join-to} table. More precisely: + \itemize{ + \item{ \code{mult} determines the policy when a row of \emph{join-from} finds multiple matches in \emph{join-to}. } + \item{ When \code{on} is missing, the key of \emph{join-to} is used as the join column(s). } + } Whether each refers to the "left" or "right" table of a pair depends on the \code{how} argument: \enumerate{ - \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-to} is \emph{RHS}, \emph{join-from} is \emph{LHS}. } - \item{ \code{how \%in\% c("inner", "full", "cross")}: \emph{LHS} and \emph{RHS} tables are treated equally, so that the terms are interchangeable. } - \item{ \code{how == "right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}. } + \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. } + \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. } + \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } + \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. } } - Using \code{mult="error"} will throw an error when multiple rows in \emph{join-to} table match to the row in \emph{join-from} table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed. + Using \code{mult="error"} will throw an error when a row in the \emph{join-from} table finds multiple matching rows in the \emph{join-to} table. It should not be used just to detect duplicates in \emph{join-to}, as these might not have a matching row in \emph{join-from}, and thus silently be missed. When not specified, \code{mult} takes its default depending on the \code{how} argument: \enumerate{ @@ -45,10 +50,10 @@ \item{ When \code{how == "cross"}, \code{mult="all"}. } } - When the \code{on} argument is missing, it will be determined based \code{how} argument: - \enumerate{ - \item{ When \code{how \%in\% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table. } - \item{ When \code{how \%in\% c("inner", full")}, if only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to shorter key. } + Symmetrical \emph{join-from}/\emph{join-to} treatment of \emph{LHS} and \emph{RHS} when \code{how \%in\% c("inner", "full")} is as follows: + \itemize{ + \item{ When \code{mult \%in\% c("first", "last", "error")}, then at distinct each value of the join column(s) the rows joined are respectively first-to-first, last-to-last, and only-to-only (or else an error). } + \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. } } When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a \emph{right} outer join can be used to optimize the sequence of merges, see Examples. From 9ddf391a07227cbe8a881de44983a0d0aed07e8e Mon Sep 17 00:00:00 2001 From: Toby Robertson <165572229+trobx@users.noreply.github.com> Date: Wed, 23 Jul 2025 00:42:55 +0100 Subject: [PATCH 2/6] Update man/mergelist.Rd Co-authored-by: Michael Chirico --- man/mergelist.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index fa1d786acd..cf4f2e0b3e 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -31,7 +31,7 @@ Heuristically speaking, the \emph{join-from} table searches the \emph{join-to} table. More precisely: \itemize{ \item{ \code{mult} determines the policy when a row of \emph{join-from} finds multiple matches in \emph{join-to}. } - \item{ When \code{on} is missing, the key of \emph{join-to} is used as the join column(s). } + \item{ When \code{on} is missing, the join column(s) are determined by the key of \emph{join-to}. } } Whether each refers to the "left" or "right" table of a pair depends on the \code{how} argument: \enumerate{ From 5d2b60b98b850f179751e1cc84d9ec30b695f88f Mon Sep 17 00:00:00 2001 From: Toby Robertson Date: Wed, 23 Jul 2025 03:40:57 +0100 Subject: [PATCH 3/6] Clearer(?) explanation of "mutual `mult`" in the symmetric case --- man/mergelist.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index cf4f2e0b3e..688d632e42 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -52,7 +52,7 @@ Symmetrical \emph{join-from}/\emph{join-to} treatment of \emph{LHS} and \emph{RHS} when \code{how \%in\% c("inner", "full")} is as follows: \itemize{ - \item{ When \code{mult \%in\% c("first", "last", "error")}, then at distinct each value of the join column(s) the rows joined are respectively first-to-first, last-to-last, and only-to-only (or else an error). } + \item{ When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side binds with the same on the other (and hence the merge is one-to-one). } \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. } } From 631a96b5fbce494dfcba89fca59f1492c9086b46 Mon Sep 17 00:00:00 2001 From: Toby Robertson <165572229+trobx@users.noreply.github.com> Date: Wed, 23 Jul 2025 03:49:17 +0100 Subject: [PATCH 4/6] Update man/mergelist.Rd Co-authored-by: Michael Chirico --- man/mergelist.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index 688d632e42..a578ab40bc 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -37,7 +37,7 @@ \enumerate{ \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. } \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. } - \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } + \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that the \emph{join-from}/\emph{join-to} designation is not so instructive. } \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. } } From 73b60c065c95caa45510114a125c9137541a1061 Mon Sep 17 00:00:00 2001 From: Toby Robertson Date: Wed, 23 Jul 2025 04:01:48 +0100 Subject: [PATCH 5/6] Reverting to the original PR pending discussion, after accidentally committing a suggestion. --- man/mergelist.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index a578ab40bc..688d632e42 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -37,7 +37,7 @@ \enumerate{ \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. } \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. } - \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that the \emph{join-from}/\emph{join-to} designation is not so instructive. } + \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. } } From 5e249b918bedbd83573366de470f722808efe50f Mon Sep 17 00:00:00 2001 From: Toby Robertson Date: Thu, 24 Jul 2025 07:37:44 +0100 Subject: [PATCH 6/6] Move paras, further consolidate core join-from/join-to definition --- man/mergelist.Rd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/man/mergelist.Rd b/man/mergelist.Rd index 688d632e42..3e6822fd20 100644 --- a/man/mergelist.Rd +++ b/man/mergelist.Rd @@ -37,10 +37,16 @@ \enumerate{ \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. } \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. } - \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } + \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated symmetrically, so that each is both \emph{join-from} and \emph{join-to}; see below. } \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. } } + In Case 3, the symmetry is as follows: + \itemize{ + \item{ When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side binds with the same on the other. \code{mult} is satisfied mutually and the merge is one-to-one. } + \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. } + } + Using \code{mult="error"} will throw an error when a row in the \emph{join-from} table finds multiple matching rows in the \emph{join-to} table. It should not be used just to detect duplicates in \emph{join-to}, as these might not have a matching row in \emph{join-from}, and thus silently be missed. When not specified, \code{mult} takes its default depending on the \code{how} argument: @@ -50,12 +56,6 @@ \item{ When \code{how == "cross"}, \code{mult="all"}. } } - Symmetrical \emph{join-from}/\emph{join-to} treatment of \emph{LHS} and \emph{RHS} when \code{how \%in\% c("inner", "full")} is as follows: - \itemize{ - \item{ When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side binds with the same on the other (and hence the merge is one-to-one). } - \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. } - } - When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a \emph{right} outer join can be used to optimize the sequence of merges, see Examples. } \value{