Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R function set_standard_names() does not remove special characters #1

Open
mlaunois opened this issue Mar 1, 2019 · 1 comment
Open

Comments

@mlaunois
Copy link

mlaunois commented Mar 1, 2019

English version

I'm working on a dataset about the ISF (Impôt de Solidarité sur la Fortune) in France in 2017 (the file attached below). When I try to read that file using the ISO-8859-15 using the following R code:

isfdata = read_csv(
  file = "isfdata-2017.csv",
  locale = locale(
    date_format = "%d/%m/%Y",
    time_format = "%H:%m:%s",
    encoding = "ISO-8859-15"
  )) %>% set_standard_names()

I end up with a tibble containing a weird impazt_moyen_en_a\u0082¬ column.
As defined in this page, the character \u0082, named in the Unicode standard BREAK PERMITTED HERE, is a control character. The last character is also embarrassing, but this one has been inserted by read_csv.

What should I do? I cannot use the column name at all with dplyr functions and similar, even with quoting. I had to modify the file by hand to remove those special characters.


Version française

Je travaille sur des données concernant l'Impôt de Solidarité sur la Fortune en 2017 (le fichier que j'ai attaché). Quand j'essaie de lire le fichier avec l'encodage ISO-8859-15 comme ceci :

isfdata = read_csv(
  file = "isfdata-2017.csv",
  locale = locale(
    date_format = "%d/%m/%Y",
    time_format = "%H:%m:%s",
    encoding = "ISO-8859-15"
  )) %>% set_standard_names()

Je termine avec un tibble contenant une colonne bizarre nommée impazt_moyen_en_a\u0082¬.
Comme défini ici, \u0082, ou BREAK PERMITTED HERE dans le standard Unicode est un caractère de contrôle. Le dernier caractère dans la chaîne est également embarassante...

Que devrais-je faire ? ... Je ne peux pas utiliser ce nom avec les fonctions de dplyr et similaires, même en quotant...
Merci encore d'avance !

isfdata-2017.txt

@mlaunois
Copy link
Author

mlaunois commented Mar 2, 2019

Don't worry, this is due to a bug in readr: tidyverse/readr#974

The str_standardize function fails to remove Unicode escape characters from column names, once the bug. It is relevant to close the bug which is absolutely not caused by tricky?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant