German transliterations in `make_clean_names()` #534

dpprdan · 2023-03-14T12:22:35Z

make_clean_names() (and therefore clean_names() as well) currently does not support "german" transliterations with the default ascii = TRUE, contrary to their documentation. I.e. an ä should become ae, not a.

janitor::make_clean_names("qualität", transliterations = "german")
#> [1] "qualitat"
snakecase::to_any_case("qualität", transliterations = "german")
#> [1] "qualitaet"

contrast

janitor::make_clean_names("qualität", ascii = FALSE, transliterations = "german")
#> [1] "qualitaet"

This is because names get transliterated here

janitor/R/make_clean_names.R

Lines 108 to 116 in bb78f34

    
           transliterated_names <- 
        
             if (ascii) { 
        
               stringi::stri_trans_general( 
        
                 replaced_names, 
        
                 id=available_transliterators(c("Any-Latin", "Greek-Latin", "Any-NFKD", "Any-NFC", "Latin-ASCII")) 
        
               ) 
        
             } else { 
        
               replaced_names 
        
             }

before they reach snakecase::to_any_case() here

janitor/R/make_clean_names.R

Lines 147 to 156 in bb78f34

    
           cased_names <- 
        
             snakecase::to_any_case( 
        
               made_names, 
        
               case = case, 
        
               sep_in = sep_in, 
        
               transliterations = transliterations, 
        
               parsing_option = parsing_option, 
        
               numerals = numerals, 
        
               ... 
        
             )

I suppose one could also argue that if ascii = FALSE umlauts should stay umlauts? 🤷‍♂️

janitor::make_clean_names("qualität", ascii = FALSE)
#> [1] "qualitat"

I guess the former is due to the default transliterations = "Latin-ASCII" getting passed down to snakecase::to_any_case().

Wouldn’t ascii = FALSE imply a transliterations = NULL overide?

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2023-03-14
#>  pandoc   3.1.1 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  cli           3.6.0      2023-01-09 [1] CRAN (R 4.2.2)
#>  digest        0.6.31     2022-12-11 [1] CRAN (R 4.2.2)
#>  evaluate      0.20       2023-01-17 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.2.2)
#>  fs            1.6.1      2023-02-06 [1] CRAN (R 4.2.2)
#>  generics      0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
#>  glue          1.6.2.9000 2023-01-16 [1] Github (tidyverse/glue@5a16502)
#>  htmltools     0.5.4      2022-12-07 [1] CRAN (R 4.2.2)
#>  janitor       2.2.0      2023-02-02 [1] CRAN (R 4.2.2)
#>  knitr         1.42       2023-01-25 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.3      2022-10-07 [1] RSPM
#>  lubridate     1.9.2      2023-02-10 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  purrr         1.0.1      2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2     2022-11-11 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6      2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.20       2023-01-19 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  snakecase     0.11.0     2019-05-25 [1] CRAN (R 4.2.0)
#>  stringi       1.7.12     2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr       1.5.0      2022-12-02 [1] CRAN (R 4.2.2)
#>  styler        1.9.1      2023-03-04 [1] CRAN (R 4.2.2)
#>  tidyselect    1.2.0      2022-10-10 [1] RSPM
#>  timechange    0.2.0      2023-01-11 [1] CRAN (R 4.2.2)
#>  vctrs         0.5.2      2023-01-23 [1] CRAN (R 4.2.2)
#>  withr         2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.37       2023-01-31 [1] CRAN (R 4.2.2)
#>  yaml          2.3.7      2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

billdenney · 2023-03-14T14:05:36Z

@dpprdan, I understand the challenge here.

The make_clean_names() function works hard to give consistent results across platforms, locales, and R versions, as much as possible. This is a challenge that has surprised me over recent revisions of how challenging cross-platform and cross-locale standardization has been (see #492).

I agree that this is not working the way that I would have expected, given what is in the documentation. I think that the best fix is to revise the documentation to clarify the order that changes are applied and the subsequent limitations to the function.

In the documentation, we should also clarify that while the goal is interpretability, the higher-level goal is to provide consistent and usable names in R with commonly-used tools. That implies that it gives the same answer across locales, and that the answer provided is usable on the majority of keyboards (e.g. my American keyboard doesn't have an easy way to give an umlaut, nor do Indian, Japanese, and many others while most keyboards allow for writing basic ASCII).

billdenney · 2023-03-18T20:07:30Z

@dpprdan, I've looked at this in more detail today, and your tracing of the issue is correct:

ascii = TRUE will remove umlauts before transliteration occurs
the default argument for transliterations also removes umlauts

The intent of the ascii and transliterations arguments are different (even if they aren't fully independent), so I would not want to change these two options for users. Also, changing this would cause a degree of backward incompatibility for existing users with an admitted improvement in final fidelity but not a categorical improvement.

For the documentation, I went in to add some text to clarify what happens, but then I saw that it is already there. If you look in the documentation page, it indicates "the order of operations..." (search higher in the page that you linked to).

I think that your work-around of janitor::make_clean_names("qualität", ascii = FALSE, transliterations = "german") is the best that is reasonable within the current code.

dpprdan · 2023-03-20T09:39:20Z

@billdenney Thanks. I also played around with the code and, yeah, it's complicated.

FWIW I'd change the transliterations = "Latin-ASCII" to transliterations = NULL, because if ascii = TRUE then "Latin-ASCII" is applied implicitly already. If ascii = FALSE applying a transliteration to ASCII is probably not intended anyway, cf.

I suppose one could also argue that if ascii = FALSE umlauts should stay umlauts?

sfirke · 2023-03-20T15:09:27Z

Thanks for raising this @dpprdan and for investigating it @billdenney. Sounds like things are settled and I expect this conversation may be of use to future users investigating this behavior.

billdenney closed this as completed Mar 18, 2023

billdenney reopened this Mar 18, 2023

billdenney closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German transliterations in `make_clean_names()` #534

German transliterations in `make_clean_names()` #534

dpprdan commented Mar 14, 2023

billdenney commented Mar 14, 2023

billdenney commented Mar 18, 2023

dpprdan commented Mar 20, 2023

sfirke commented Mar 20, 2023

German transliterations in make_clean_names() #534

German transliterations in make_clean_names() #534

Comments

dpprdan commented Mar 14, 2023

billdenney commented Mar 14, 2023

billdenney commented Mar 18, 2023

dpprdan commented Mar 20, 2023

sfirke commented Mar 20, 2023

German transliterations in `make_clean_names()` #534

German transliterations in `make_clean_names()` #534