Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading CSV - missing lines #1096

Open
Thiago-Simoes opened this issue Jun 13, 2023 · 2 comments
Open

Error reading CSV - missing lines #1096

Thiago-Simoes opened this issue Jun 13, 2023 · 2 comments

Comments

@Thiago-Simoes
Copy link

I trying to read a CSV and I always thinked all was fine, but recently I noticed some lines missing. I don't understand why.
I did manually a function to parse the CSV and worked, but using the package some lines are missed.

The file has 34034 lines, when reading using my function it returns a 34034 lines dataframe, but using package the dataframe has 33704.
Almost 1% of lines have problems.

The file is attached, hope someone can help.
File: fi.csv

@jeremiedb
Copy link

Loading of the file using both latest CSV (v.10.11) and an earlier release (v0.10.8) all result in the same 33 704 rows:

path = joinpath(@__DIR__, "fi.csv")
file = CSV.File(path; ntasks=1, rows_to_check=30000);

I noted however that the header row as well as several others had 33 ";" delimiters, while several other rows such as the second one had only 22. This seems to point to some inconsistency in the CSV data itself.

@Liozou
Copy link
Contributor

Liozou commented Jul 27, 2023

The issue is the single quote mark at line 5741 of the csv (column 288). The next quote mark is on line 6071, so everything in between is considered to be a string, that accounts for a single field value... And for some reason it is silently converted to a missing value (which may be fixable here, if it actually is an issue?). So anyway, 6071 - 5741 == 330 lines are lost, which account for your missing lines.

To get the correct file, you can use the quoted=false option (e.g. file = CSV.File(path; quoted=false)) to simply ignore quote marks, or you can remove the offending quote mark at line 5741. I would also suggest removing the other single quote marks at line 3701, 24956 and 32356.

It's obviously bad that some lines can actually be "lost" by the parser, but of course it's very difficult to correctly handle incorrect data files... I don't know what should be done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants