Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list sub-class with format() method prints full contents #5948

Open
jesse-smith opened this issue Feb 21, 2024 · 10 comments · May be fixed by #6637
Open

list sub-class with format() method prints full contents #5948

jesse-smith opened this issue Feb 21, 2024 · 10 comments · May be fixed by #6637

Comments

@jesse-smith
Copy link

jesse-smith commented Feb 21, 2024

I ran into this when converting a nested tibble (i.e. with a list column of tibbles) to a data.table. Normally, list columns print with something like <tibble[3x1]> in data.table; however, these columns printed as a string displaying the entirety of the nested tibbles contents (in my case, ~20k rows of data per tibble). This appears to be due to a check for the existence of a format() method for the column type in data.table:::format_col(). In this case, the list columns were vctrs_list_of class, which implements its own format.vctrs_list_of() method. See below reprex for an example.

Fixing this is easy enough (see below reprex for proposed solution), but it would change the default printing behavior of list-cols. Any list subclass would then have to implement a format_list_item() method to get special treatment. On the other hand, any list subclass could then implement that method and get special treatment.

To me, defaulting to the standard list print behavior is an improvement, given that format() methods for list-like classes are generally not going to product output suitable for a column in a data.table (which is why format_list_item() is needed in the first place).

Reprex is below, along with proposed solution. I'm happy to file a PR but wanted to make sure the change in defaults is acceptable in principle first. Thanks!

# Setup -----------------------------------------------------------------------

# Use development version of {data.table}
data.table::update_dev_pkg()
#> R data.table package is up-to-date at 8f8ef9343dcabefa9e4cb0af4251cbb74dae9f55 (1.15.99)
# Attach internals
ns <- asNamespace("data.table")
name <- format(ns)
attach(ns, name = name, warn.conflicts = FALSE)

# Create example data
dt <- data.table(
  list_col = list(data.table(a = 1:3), data.table(a = 4:6)),
  list_of_col = vctrs::list_of(data.table(a = 1:3), data.table(a = 4:6))
)


# Problem ---------------------------------------------------------------------
# `format_col()` does not format `list_of` columns from {vctrs} as lists

# Print `data.table`
print(dt)
#>             list_col     list_of_col
#>               <list> <vctrs_list_of>
#> 1: <data.table[3x1]>         1, 2, 3
#> 2: <data.table[3x1]>         4, 5, 6

# Format individually
format_col(dt$list_col)
#> [1] "<data.table[3x1]>" "<data.table[3x1]>"
format_col(dt$list_of_col)
#> [1] "1, 2, 3" "4, 5, 6"


# Solution --------------------------------------------------------------------

# Current function definition:
# format_col.default = function(x, ...) {
#   if (!is.null(dim(x)))
#     "<multi-column>"
#   else if (has_format_method(x) && length(formatted<-format(x, ...))==length(x))
#     formatted
#   else if (is.list(x))
#     vapply_1c(x, format_list_item, ...)
#   else
#     format(char.trunc(x), ...)
# }

# Swapping the order of the `else if` statements in `format_col()` will fix the issue
# `format_col_updated()` formats `list_of` columns from {vctrs} as lists
format_col_updated = function(x, ...) {
  if (!is.null(dim(x)))
    "<multi-column>"
  else if (is.list(x)) # Now comes before `has_format_method()` check
    vapply_1c(x, format_list_item, ...)
  else if (has_format_method(x) && length(formatted<-format(x, ...))==length(x))
    formatted
  else
    format(char.trunc(x), ...)
}

# Replace as default method
registerS3method("format_col", "default", format_col_updated)

# Print `data.table`
print(dt)
#>             list_col       list_of_col
#>               <list>   <vctrs_list_of>
#> 1: <data.table[3x1]> <data.table[3x1]>
#> 2: <data.table[3x1]> <data.table[3x1]>

# Print individually
# Still formatted as list
format_col_updated(dt$list_col)
#> [1] "<data.table[3x1]>" "<data.table[3x1]>"
# Now also formatted as list
format_col_updated(dt$list_of_col)
#> [1] "<data.table[3x1]>" "<data.table[3x1]>"

# Clean up --------------------------------------------------------------------
detach(name, character.only = TRUE)
rm(list = ls())

Created on 2024-02-21 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31 ucrt)
#>  os       Windows 11 x64 (build 22631)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/Chicago
#>  date     2024-02-21
#>  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.2)
#>  data.table    1.15.99 2024-02-21 [1] local
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.2)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.2)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.2)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.2)
#>  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
#>  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.2)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.2)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.2)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.1)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.1)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.3.2)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.2)
#>  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.2)
#>  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.2)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.2)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
#>  styler        1.10.2  2023-08-29 [1] CRAN (R 4.3.2)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.2)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.2)
#>  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.2)
#> 
#>  [1] D:/ProgramFiles/R/R-4.3.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@tdhock
Copy link
Member

tdhock commented Feb 21, 2024

Hi Jesse, thanks for writing. Is there a way to fix your issue in a way that maintains backward compatibility with the existing print output? One of the guiding principles of data.table is stability/back-compatibility https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package so it would be much easier to accept a code contribution that does not change what is printed.

@jesse-smith
Copy link
Author

Hey Toby, thanks for getting back. The current behavior is actually not backwards compatible - list sub-classes print identically to bare lists through v1.14.10, and it's only in v1.15.0 and later that this "print everything" behavior occurs. The proposal is essentially to revert to the previous behavior, because there are cases where the current behavior is pathological, and I'm not sure the current behavior was actually intended (I can't find anything in NEWS that mentions it).

@r2evans
Copy link
Contributor

r2evans commented Oct 28, 2024

This formats differently because getS3method("format", class="vctrs_list_of") is defined whereas getS3method("format", class="list") is not. My #6593 does not resolve this.

This may be a frustrating issue. I like the notion that format.data.table honors format.<col-class> if found, but the fact that vctrs::format.vctrs_list_of is simply a wrapper around format.default seems less useful (to me).

If you go down the road of over-riding for this class, then pandora's box of "override this class and that class and that other class" may not be desired, plus what happens if/when vctrs changes its format S3 methods.

While frustrating, I wonder if the better path to resolve this aesthetic request is to request that vctrs::format.vctrs_list_of does something different, in which case it will self-resolve within the data.table ecosystem.

@MichaelChirico
Copy link
Member

cc @DavisVaughan

@DavisVaughan
Copy link
Contributor

I wonder if the better path to resolve this aesthetic request is to request that vctrs::format.vctrs_list_of does something different

I don't think so, no. At the end of the day, we are simply following format.list(), and it seem reasonable that format(<list>) and format(<list-of>) do exactly the same thing.

On the tibble side, both pillar:::pillar_shaft.list() and pillar:::pillar_shaft.vctrs_list_of() exist for controlling how they are printed.

I do not think it is out of the question for you to provide format_col.vctrs_list_of() that gets conditionally registered using an .onLoad hook for the vctrs package. Kind of like this, where in vctrs we specify that "when dplyr is loaded, go ahead and register a vec_restore() method for grouped_df"
https://github.com/r-lib/vctrs/blob/78d9f2b0b24131b5ce2230eb3c2c9f93620b10d9/R/zzz.R#L33-L36

@r2evans
Copy link
Contributor

r2evans commented Oct 28, 2024

I'm good with including that in my current PR (if that doesn't get too complicated). If yes, I'm assuming dt "should" look like this:

dt
#             list_col       list_of_col
#               <list>      <vctrs_list>
# 1: <data.table[3x1]> <data.table[3x1]>
# 2: <data.table[3x1]> <data.table[3x1]>

I added the header class, should it be vctrs_list_of, vctrs_vctr, or just list?

@DavisVaughan
Copy link
Contributor

For data.table users I imagine it might make the most sense if you have the full class, since this probably isn't a super common class to see in the data.table world, and in that case the full explicit name is easier to understand than hiding some class name details.

We have some tooling for showing list<type> where type is an abbreviated name of the class the container is for
Screenshot 2024-10-28 at 3 57 34 PM

@r2evans
Copy link
Contributor

r2evans commented Oct 28, 2024

Okay. This is starting to become a little more than "add some rider code to a PR", so let me know (@MichaelChirico ) if this is out of scope to be appended (to an already-approved PR).

I think we can do this:

@@ -105,6 +105,14 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
       expression = "<expr>", ordered = "<ord>")
     classes = classes1(x)
     abbs = unname(class_abb[classes])
+    abb_na = is.na(abbs)
+    if (any(abb_na) && any(classes[abb_na] %in% c("vctrs_list_of", "vctrs_vctr"))) {
+      if (!requireNamespace("vctrs", quietly=TRUE)) {
+        warningf("Some columns are class 'vctrs_list_of' but package vctrs is not available. Those columns will print as regular objects. Simplify install.packages('vctrs') to obtain the vctrs_list_of format method and print the data again.")
+      } else {
+        abbs[abb_na] <- paste0("<", vapply_1c(x, vctrs::vec_ptype_abbr)[abb_na], ">")
+      }
+    }
     if ( length(idx <- which(is.na(abbs))) ) abbs[idx] = paste0("<", classes[idx], ">")
     toprint = rbind(abbs, toprint)
     rownames(toprint)[1L] = ""
@@ -207,6 +215,10 @@ format_col.default = function(x, ...) {
     format(char.trunc(x), ...) # relevant to #37
 }
 
+format_col.vctrs_list_of = function(x, ...) {
+  vapply_1c(x, format_list_item, ...)
+}
+
 # #2842 -- different columns can have different tzone, so force usage in output
 format_col.POSIXct = function(x, ..., timezone=FALSE) {
   if (timezone) {

and

@@ -199,6 +199,7 @@ export(format_col)
 S3method(format_col, default)
 S3method(format_col, POSIXct)
 S3method(format_col, expression)
+S3method(format_col, vctrs_list_of)
 export(format_list_item)
 S3method(format_list_item, default)
 S3method(format_list_item, data.frame)

Which would result in

dt
#             list_col       list_of_col
#               <list>    <list<dt[,1]>>
# 1: <data.table[3x1]> <data.table[3x1]>
# 2: <data.table[3x1]> <data.table[3x1]>

@r2evans
Copy link
Contributor

r2evans commented Nov 9, 2024

@MichaelChirico thoughts on that?

@MichaelChirico
Copy link
Member

That will induce a Suggests dependency on {vctrs} so it'll be a "no".

Turns out there is a trivial fix for the bug at hand: #6637.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants