Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German transliterations in make_clean_names() #534

Closed
dpprdan opened this issue Mar 14, 2023 · 4 comments
Closed

German transliterations in make_clean_names() #534

dpprdan opened this issue Mar 14, 2023 · 4 comments

Comments

@dpprdan
Copy link

dpprdan commented Mar 14, 2023

make_clean_names() (and therefore clean_names() as well) currently does not support "german" transliterations with the default ascii = TRUE, contrary to their documentation. I.e. an ä should become ae, not a.

janitor::make_clean_names("qualität", transliterations = "german")
#> [1] "qualitat"
snakecase::to_any_case("qualität", transliterations = "german")
#> [1] "qualitaet"

contrast

janitor::make_clean_names("qualität", ascii = FALSE, transliterations = "german")
#> [1] "qualitaet"

This is because names get transliterated here

transliterated_names <-
if (ascii) {
stringi::stri_trans_general(
replaced_names,
id=available_transliterators(c("Any-Latin", "Greek-Latin", "Any-NFKD", "Any-NFC", "Latin-ASCII"))
)
} else {
replaced_names
}

before they reach snakecase::to_any_case() here
cased_names <-
snakecase::to_any_case(
made_names,
case = case,
sep_in = sep_in,
transliterations = transliterations,
parsing_option = parsing_option,
numerals = numerals,
...
)

I suppose one could also argue that if ascii = FALSE umlauts should stay umlauts? 🤷‍♂️

janitor::make_clean_names("qualität", ascii = FALSE)
#> [1] "qualitat"

I guess the former is due to the default transliterations = "Latin-ASCII" getting passed down to snakecase::to_any_case().

Wouldn’t ascii = FALSE imply a transliterations = NULL overide?

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2023-03-14
#>  pandoc   3.1.1 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  cli           3.6.0      2023-01-09 [1] CRAN (R 4.2.2)
#>  digest        0.6.31     2022-12-11 [1] CRAN (R 4.2.2)
#>  evaluate      0.20       2023-01-17 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.2.2)
#>  fs            1.6.1      2023-02-06 [1] CRAN (R 4.2.2)
#>  generics      0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
#>  glue          1.6.2.9000 2023-01-16 [1] Github (tidyverse/glue@5a16502)
#>  htmltools     0.5.4      2022-12-07 [1] CRAN (R 4.2.2)
#>  janitor       2.2.0      2023-02-02 [1] CRAN (R 4.2.2)
#>  knitr         1.42       2023-01-25 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.3      2022-10-07 [1] RSPM
#>  lubridate     1.9.2      2023-02-10 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  purrr         1.0.1      2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2     2022-11-11 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6      2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.20       2023-01-19 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  snakecase     0.11.0     2019-05-25 [1] CRAN (R 4.2.0)
#>  stringi       1.7.12     2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr       1.5.0      2022-12-02 [1] CRAN (R 4.2.2)
#>  styler        1.9.1      2023-03-04 [1] CRAN (R 4.2.2)
#>  tidyselect    1.2.0      2022-10-10 [1] RSPM
#>  timechange    0.2.0      2023-01-11 [1] CRAN (R 4.2.2)
#>  vctrs         0.5.2      2023-01-23 [1] CRAN (R 4.2.2)
#>  withr         2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.37       2023-01-31 [1] CRAN (R 4.2.2)
#>  yaml          2.3.7      2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@billdenney
Copy link
Collaborator

@dpprdan, I understand the challenge here.

The make_clean_names() function works hard to give consistent results across platforms, locales, and R versions, as much as possible. This is a challenge that has surprised me over recent revisions of how challenging cross-platform and cross-locale standardization has been (see #492).

I agree that this is not working the way that I would have expected, given what is in the documentation. I think that the best fix is to revise the documentation to clarify the order that changes are applied and the subsequent limitations to the function.

In the documentation, we should also clarify that while the goal is interpretability, the higher-level goal is to provide consistent and usable names in R with commonly-used tools. That implies that it gives the same answer across locales, and that the answer provided is usable on the majority of keyboards (e.g. my American keyboard doesn't have an easy way to give an umlaut, nor do Indian, Japanese, and many others while most keyboards allow for writing basic ASCII).

@billdenney
Copy link
Collaborator

@dpprdan, I've looked at this in more detail today, and your tracing of the issue is correct:

  • ascii = TRUE will remove umlauts before transliteration occurs
  • the default argument for transliterations also removes umlauts

The intent of the ascii and transliterations arguments are different (even if they aren't fully independent), so I would not want to change these two options for users. Also, changing this would cause a degree of backward incompatibility for existing users with an admitted improvement in final fidelity but not a categorical improvement.

For the documentation, I went in to add some text to clarify what happens, but then I saw that it is already there. If you look in the documentation page, it indicates "the order of operations..." (search higher in the page that you linked to).

I think that your work-around of janitor::make_clean_names("qualität", ascii = FALSE, transliterations = "german") is the best that is reasonable within the current code.

@billdenney billdenney reopened this Mar 18, 2023
@billdenney billdenney closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2023
@dpprdan
Copy link
Author

dpprdan commented Mar 20, 2023

@billdenney Thanks. I also played around with the code and, yeah, it's complicated.

FWIW I'd change the transliterations = "Latin-ASCII" to transliterations = NULL, because if ascii = TRUE then "Latin-ASCII" is applied implicitly already. If ascii = FALSE applying a transliteration to ASCII is probably not intended anyway, cf.

I suppose one could also argue that if ascii = FALSE umlauts should stay umlauts?

@sfirke
Copy link
Owner

sfirke commented Mar 20, 2023

Thanks for raising this @dpprdan and for investigating it @billdenney. Sounds like things are settled and I expect this conversation may be of use to future users investigating this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants