Fix recursive flattening failure with combined primary and foreign key#2312
Fix recursive flattening failure with combined primary and foreign key#2312
Conversation
…ign key Co-authored-by: krlmlr <1741643+krlmlr@users.noreply.github.com>
Co-authored-by: krlmlr <1741643+krlmlr@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes a failure in dm_flatten_to_tbl(.recursive = TRUE) when an intermediate table column is simultaneously a primary key and a foreign key, by changing how recursive joins are executed and adding a regression test for the reported model shape.
Changes:
- Reworked recursive flatten join execution from a
reduce2()pipeline to an explicit loop with pre-join column checks/mapping. - Added a test reproducing the combined PK/FK recursive flatten failure (issue #2234).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
R/flatten.R |
Alters recursive flatten join strategy and adds logic intended to map renamed/missing join columns. |
tests/testthat/test-flatten.R |
Adds regression coverage for recursive flattening when a column is both PK and FK. |
| if (squash && length(order_df$name) > 0) { | ||
| # For recursive flattening, use an iterative approach to handle | ||
| # cases where columns serve as both PK and FK | ||
| result <- tbl_impl(prep_dm, start) | ||
|
|
||
| for (i in seq_along(order_df$name)) { | ||
| table_to_join <- ordered_table_list[[i]] | ||
| join_by <- by[[i]] | ||
|
|
||
| # Additional safety check: verify join columns exist | ||
| result_cols <- colnames(result) | ||
| left_cols <- names(join_by) | ||
| right_cols <- colnames(table_to_join) | ||
| join_right_cols <- unname(join_by) | ||
|
|
||
| # Check if all required columns exist | ||
| missing_left <- setdiff(left_cols, result_cols) | ||
| missing_right <- setdiff(join_right_cols, right_cols) | ||
|
|
||
| if (length(missing_left) > 0) { | ||
| # This is where the error likely occurs | ||
| # For recursive joins, we might need to handle column name mapping | ||
| # Look for disambiguated versions of missing columns | ||
| mapped_join_by <- join_by | ||
| for (missing_col in missing_left) { | ||
| # Look for pattern like "original_name.table_suffix" | ||
| pattern_matches <- result_cols[grepl(paste0("^", missing_col, "\\."), result_cols)] | ||
| if (length(pattern_matches) > 0) { | ||
| # Replace the missing column name with the disambiguated version | ||
| # Use the first match if multiple exist | ||
| names(mapped_join_by)[names(mapped_join_by) == missing_col] <- pattern_matches[1] | ||
| } | ||
| # If no pattern matches found, keep the original name and let the join fail with a clear error | ||
| } | ||
| join_by <- mapped_join_by | ||
| } | ||
|
|
||
| result <- join(result, table_to_join, by = join_by) | ||
| } |
There was a problem hiding this comment.
The recursive-join fix only tries to remap missing LHS join columns to disambiguated variants like c.<suffix>. In the reported scenario (and in the new test), the column is missing because the previous join drops the RHS join key when by maps different names (e.g. b = c), so there is no c (or c.*) to remap. As a result, the subsequent join can still fail with the same "Join columns in x must be present" error. Consider instead (a) preserving RHS join keys (keep = TRUE) when they are needed for later joins and then dropping them once no longer needed, or (b) tracking equivalence from earlier by specs (e.g. c is equivalent to b after the b = c join) and rewriting later by clauses accordingly.
There was a problem hiding this comment.
Replaced the regex-based approach with proper column equivalence tracking. After each join where lhs_col != rhs_col, we record rhs_col → lhs_col. Before each subsequent join, any missing LHS keys are remapped through this equivalence map. This correctly handles the reported case without needing keep = TRUE. (commit 8efe212)
| # Look for pattern like "original_name.table_suffix" | ||
| pattern_matches <- result_cols[grepl(paste0("^", missing_col, "\\."), result_cols)] |
There was a problem hiding this comment.
grepl(paste0("^", missing_col, "\\."), result_cols) treats missing_col as a regex. Column names can contain regex metacharacters (e.g. +, (, [), which can make the match incorrect and cause wrong join key mapping. Prefer a literal prefix check such as startsWith(result_cols, paste0(missing_col, ".")) (or otherwise escape missing_col for regex).
| # Look for pattern like "original_name.table_suffix" | |
| pattern_matches <- result_cols[grepl(paste0("^", missing_col, "\\."), result_cols)] | |
| # Look for pattern like "original_name.table_suffix" using a literal prefix match | |
| pattern_matches <- result_cols[startsWith(result_cols, paste0(missing_col, "."))] |
There was a problem hiding this comment.
The regex-based approach has been removed entirely. The fix now uses column equivalence tracking, which avoids any string pattern matching altogether. (commit 8efe212)
| # Check if all required columns exist | ||
| missing_left <- setdiff(left_cols, result_cols) | ||
| missing_right <- setdiff(join_right_cols, right_cols) | ||
|
|
There was a problem hiding this comment.
missing_right is computed but never used, which makes the new join loop harder to follow. Either remove it or use it to produce a clearer error when the RHS join columns are missing.
There was a problem hiding this comment.
Removed missing_right along with the rest of the old approach. (commit 8efe212)
|
|
||
| test_that("`dm_flatten_to_tbl(.recursive = TRUE)` works with combined primary and foreign key", { | ||
| # Test for issue where a column serves as both PK and FK | ||
| skip_if_src_not("df") # Only test on data frame source for now |
There was a problem hiding this comment.
This test is currently restricted to the df backend, but the flattening code path is shared across sources and this file already runs flatten tests for duckdb too. Consider expanding this to skip_if_src_not(c("df", "duckdb")) (or whatever set is supported) so the regression is caught for other backends.
| skip_if_src_not("df") # Only test on data frame source for now | |
| skip_if_src_not(c("df", "duckdb")) |
There was a problem hiding this comment.
Updated to skip_if_src_not(c("df", "duckdb")). (commit 8efe212)
| # Expected result should have all columns from x, y, and z | ||
| expected <- tibble( | ||
| a = 1L, | ||
| b = 2L, | ||
| d = 3L, | ||
| f = 4L | ||
| ) |
There was a problem hiding this comment.
The comment says the expected result should have all columns from x, y, and z, but expected omits the key columns (c and e). This is a bit misleading—either update the comment to reflect that join keys are dropped in the flattened output, or include the missing columns in expected if they are supposed to be present.
There was a problem hiding this comment.
Updated the comment to: "Join keys from the RHS (c and e) are dropped by dplyr; only non-key columns remain." (commit 8efe212)
|
@copilot apply changes based on the comments in this thread . Also fix now failing tests. |
Co-authored-by: krlmlr <1741643+krlmlr@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
Too big and too unclear. |
missing_rightvariable and flawed regex-based column remappingskip_if_src_not(c("df", "duckdb"))💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.