Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON to CSV Error #1534

Open
osevill opened this issue Mar 25, 2024 · 8 comments
Open

JSON to CSV Error #1534

osevill opened this issue Mar 25, 2024 · 8 comments
Assignees
Labels

Comments

@osevill
Copy link

osevill commented Mar 25, 2024

In previous versions mlr --j2c group-like worked fine.
In version 6.12 the same script with the same json file gives an error:
mlr: CSV schema change: first keys "..."; current keys "..."

The fields/properties do change from one object element of the json array to the next, but that's why I use group-like.

Thanks

@aborruso
Copy link
Contributor

Could you share a sample file and a sample command?

@osevill
Copy link
Author

osevill commented Mar 25, 2024

given this sample file:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_1":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_1":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

if I run mlr --j2c group-like sample.csv using ver 6.12 (or 6.11)
I get this:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
Shangri-La Hotel,25.2084,55.2719,
Grand Hyatt,25.2285,55.3273,1,2,3

even though each object element has slightly different fields, where I would expect a new csv header row each time.

if I change the source file so that the nested array names differ from element to element of the outer array:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
I get this error:

mlr: CSV schema change: first keys "name,location_1.1.lat,location_1.2.long,field_1"; current keys "name,location_2.1.lat,location_2.2.long"
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
mlr: exiting due to data error.

in version 6.10, this works as expected:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_1.1.lat,location_1.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_1.1.lat,location_1.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

and

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_2.1.lat,location_2.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_3.1.lat,location_3.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

Also, if I use --j2p (instead of --j2c) in 6.12, it seems to work fine however.

Thanks.

@aborruso
Copy link
Contributor

Hi @osevill it's probably related to flatten/unflatten, but I do not have a solution for you.

@johnkerl will be able to help you.

@johnkerl
Copy link
Owner

johnkerl commented Mar 26, 2024

@osevill the 6.11.0 release (https://github.com/johnkerl/miller/releases/tag/v6.11.0) contains PR #1479 which addresses issue #1418.

Before this, Miller was in some cases producing non-compliant CSV output:

$ cat i.j
[
  { "a": 1 },
  { "b": 2, "c": 3 }
]
$ mlr-6.10.0 --j2c cat i.j
a
1

b,c
2,3

After this, Miller now produces compliant CSV output, or says that it can't:

$ mlr-6.11.0 --j2c cat i.j
a
1
mlr: CSV schema change: first keys "a"; current keys "b,c"
mlr: exiting due to data error.

If one row's list of column names is a strict subset of the others it can auto-unsparsify:

$ cat k.j
[
  { "a": 1, "b": 2 },
  { "a": 3 }
]

$ mlr-6.10.0 --j2c cat k.j
a,b
1,2

a
3

$ mlr-6.11.0 --j2c cat k.j
a,b
1,2
3,

The concern raised by issue #1418, and addressed by PR #1479, is that Miller should stop producing "CSV" with non-compliant blank lines in it. @aborruso was right to request to #1418.

For the data files in this issue, the records are truly non-homogeneous and are truly not representable as compliant CSV.

Two options I can suggest:

  • Use csvlite output format (https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc)
    • This doesn't claim to comply with RFC4180
    • It allows non-homogeneous records, separated with line breaks, which is a good match for the kind of output data you want to obtain
    • mlr --ijson --ocsvlite group-like i.j
  • Use unsparsify to obtain compliant CSV
    • mlr --ijson --ocsv group-like then flatten then unsparsify i.j

@aborruso
Copy link
Contributor

mlr --ijson --ocsv group-like then flatten then unsparsify i.j

I was looking for this, but each time I am unable to reconstruct it. Thank you @johnkerl

@osevill
Copy link
Author

osevill commented Mar 29, 2024

@johnkerl The feedback above is great help, particularly ...group-like then flatten then unsparsify to have one unique header row with fields from all json array elements.
But I find it useful sometimes to see csv records in groups, by distinct header, which is what your first suggestion does. You're correct that the sample I provided has no commas in the data fields and so csvlite works, but I then realized that my actual data does sometimes have commas inside double-quoted values, which is why the --j2c option worked well for me prior to 6.11.

Since my json data does have commas in the values, here's what I came up with...

Given this sample file (this time with commas in the values):
[{"name":"Rixos,The,Palm,Dubai","location":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri,La,Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand,Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

...if I convert from json to tsvlite when doing the group-like, I avoid field breaks after each comma in the value. (I'm assuming my data will not have tabs in the values, which has been the case so far...or embedded newlines); then I do a separate mlr cat just to convert from tab-delimiters to a symbol delimiter of my choice (so that the delimiter is a printable character). At this point, I tell whatever software I'm using that the delimiter is my symbol, and I'm good:

mlr --ijson --otsvlite --from ./sample_json_array.json group-like | mlr --itsvlite --ocsvlite --ofs '•' cat > ./output_file.csv

Is there a simpler way to change the delimiter than calling mlr again and changing the output field separator? tsvlite doesn't seem to support changing the output field separator.

Don't know your thoughts on this but would it be worthwhile to have a new file format that is "in between" csvlite and csv, in the sense that it would be csvlite + support for commas or newlines embedded in double quotes, but because it wouldn't adhere to the RFC4180 spec, it would allow for blank rows in the output?

In this particular instance, it would assist me in accomplishing my json to row-based transform in just one mlr group-like.

Thanks again for the feedback.

@johnkerl
Copy link
Owner

johnkerl commented Mar 30, 2024

@osevill the best option would be to restore the behavior of #1479, default off, only behind a new opt-in flag -- I can do this, no problem.

@osevill
Copy link
Author

osevill commented Mar 30, 2024

Thanks!
So the proposed behavior would be to continue to auto-unsparsify all header fields of csv/tsv output, but with a new optional flag to not auto-unsparsify (which would make mlr --j2c group-like with the new flag work like it did in 6.10)?

@johnkerl johnkerl self-assigned this Jun 9, 2024
@johnkerl johnkerl added the active label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants