Skip to content

Commit

Permalink
Auto-unsparsify CSV and TSV on output (#1479)
Browse files Browse the repository at this point in the history
* Auto-unsparsify CSV

* Update unit-test cases

* More unit-test cases

* Key-change handling for CSV output

* Same for TSV, with unit-test and doc updates
  • Loading branch information
johnkerl authored Jan 20, 2024
1 parent af021f2 commit ac65675
Show file tree
Hide file tree
Showing 61 changed files with 479 additions and 219 deletions.
5 changes: 5 additions & 0 deletions docs/src/data/key-change.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[
{ "a": 1, "b": 2, "c": 3 },
{ "a": 4, "b": 5, "c": 6 },
{ "a": 7, "X": 8, "c": 9 }
]
6 changes: 6 additions & 0 deletions docs/src/data/under-over.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[
{ "a": 1, "b": 2, "c": 3 },
{ "a": 4, "b": 5, "c": 6, "d": 7 },
{ "a": 7, "b": 8 },
{ "a": 9, "b": 10, "c": 11 }
]
68 changes: 68 additions & 0 deletions docs/src/file-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,74 @@ In particular, no encode/decode of `\r`, `\n`, `\t`, or `\\` is done.

* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* CSV-lite and TSV-lite handle schema changes ("schema" meaning "ordered list of field names in a given record") by adding a newline and re-emitting the header. CSV and TSV, by contrast, do the following:
* If there are too few keys, but these match the header, empty fields are emitted.
* If there are too many keys, but these match the header up to the number of header fields, the extra fields are emitted.
* If keys don't match the header, this is an error.

<pre class="pre-highlight-in-pair">
<b>cat data/under-over.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{ "a": 1, "b": 2, "c": 3 },
{ "a": 4, "b": 5, "c": 6, "d": 7 },
{ "a": 7, "b": 8 },
{ "a": 9, "b": 10, "c": 11 }
]
</pre>

<pre class="pre-highlight-in-pair">
<b>mlr --ijson --ocsvlite cat data/under-over.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3

a,b,c,d
4,5,6,7

a,b
7,8

a,b,c
9,10,11
</pre>

<pre class="pre-highlight-in-pair">
<b>mlr --ijson --ocsvlite cat data/key-change.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
4,5,6

a,X,c
7,8,9
</pre>

<pre class="pre-highlight-in-pair">
<b>mlr --ijson --ocsv cat data/under-over.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
4,5,6,7
7,8,
9,10,11
</pre>

<pre class="pre-highlight-in-pair">
<b>mlr --ijson --ocsv cat data/key-change.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
4,5,6
mlr: CSV schema change: first keys "a,b,c"; current keys "a,X,c"
mlr: exiting due to data error.
</pre>

* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
Expand Down
25 changes: 25 additions & 0 deletions docs/src/file-formats.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,31 @@ In particular, no encode/decode of `\r`, `\n`, `\t`, or `\\` is done.

* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.

* CSV-lite and TSV-lite handle schema changes ("schema" meaning "ordered list of field names in a given record") by adding a newline and re-emitting the header. CSV and TSV, by contrast, do the following:
* If there are too few keys, but these match the header, empty fields are emitted.
* If there are too many keys, but these match the header up to the number of header fields, the extra fields are emitted.
* If keys don't match the header, this is an error.

GENMD-RUN-COMMAND
cat data/under-over.json
GENMD-EOF

GENMD-RUN-COMMAND
mlr --ijson --ocsvlite cat data/under-over.json
GENMD-EOF

GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --ijson --ocsvlite cat data/key-change.json
GENMD-EOF

GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/under-over.json
GENMD-EOF

GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --ijson --ocsv cat data/key-change.json
GENMD-EOF

* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)

CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
Expand Down
4 changes: 1 addition & 3 deletions docs/src/questions-about-joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,9 +118,7 @@ However, if we ask for left-unpaireds, since there's no `color` column, we get a
id,code,color
4,ff0000,red
2,00ff00,green

id,code
3,0000ff
3,0000ff,
</pre>

To fix this, we can use **unsparsify**:
Expand Down
51 changes: 37 additions & 14 deletions docs/src/record-heterogeneity.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,13 +375,12 @@ record_count=150,resource=/path/to/second/file
CSV and pretty-print formats expect rectangular structure. But Miller lets you
process non-rectangular using CSV and pretty-print.

Miller simply prints a newline and a new header when there is a schema change
-- where by _schema_ we mean simply the list of record keys in the order they
are encountered. When there is no schema change, you get CSV per se as a
special case. Likewise, Miller reads heterogeneous CSV or pretty-print input
the same way. The difference between CSV and CSV-lite is that the former is
[RFC-4180-compliant](file-formats.md#csvtsvasvusvetc), while the latter readily
handles heterogeneous data (which is non-compliant). For example:
For CSV-lite and TSV-lite, Miller simply prints a newline and a new header when there is a schema
change -- where by _schema_ we mean simply the list of record keys in the order they are
encountered. When there is no schema change, you get CSV per se as a special case. Likewise, Miller
reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite
is that the former is [RFC-4180-compliant](file-formats.md#csvtsvasvusvetc), while the latter
readily handles heterogeneous data (which is non-compliant). For example:

<pre class="pre-highlight-in-pair">
<b>cat data/het.json</b>
Expand Down Expand Up @@ -446,19 +445,43 @@ record_count resource
150 /path/to/second/file
</pre>

Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use `--allow-ragged-csv-input` (or keystroke-saver `--ragged`).
<pre class="pre-highlight-in-pair">
<b>mlr --ijson --ocsvlite group-like data/het.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
resource,loadsec,ok
/path/to/file,0.45,true
/path/to/second/file,0.32,true
/some/other/path,0.97,false

record_count,resource
100,/path/to/file
150,/path/to/second/file
</pre>

<pre class="pre-highlight-in-pair">
<b>mlr --csv --ragged cat data/het/ragged.csv</b>
<b>mlr --ijson --ocsv group-like data/het.json</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
resource,loadsec,ok
/path/to/file,0.45,true
/path/to/second/file,0.32,true
/some/other/path,0.97,false
mlr: CSV schema change: first keys "resource,loadsec,ok"; current keys "record_count,resource"
mlr: exiting due to data error.
</pre>

a,b
4,5
Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if
there are implicit header changes (no intervening blank line and new header line) as seen above --
you can use `--allow-ragged-csv-input` (or keystroke-saver `--ragged`).

a,b,c,4
<pre class="pre-highlight-in-pair">
<b>mlr --csv --allow-ragged-csv-input cat data/het/ragged.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
4,5,
7,8,9,10
</pre>

Expand Down
27 changes: 18 additions & 9 deletions docs/src/record-heterogeneity.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -180,13 +180,12 @@ GENMD-EOF
CSV and pretty-print formats expect rectangular structure. But Miller lets you
process non-rectangular using CSV and pretty-print.

Miller simply prints a newline and a new header when there is a schema change
-- where by _schema_ we mean simply the list of record keys in the order they
are encountered. When there is no schema change, you get CSV per se as a
special case. Likewise, Miller reads heterogeneous CSV or pretty-print input
the same way. The difference between CSV and CSV-lite is that the former is
[RFC-4180-compliant](file-formats.md#csvtsvasvusvetc), while the latter readily
handles heterogeneous data (which is non-compliant). For example:
For CSV-lite and TSV-lite, Miller simply prints a newline and a new header when there is a schema
change -- where by _schema_ we mean simply the list of record keys in the order they are
encountered. When there is no schema change, you get CSV per se as a special case. Likewise, Miller
reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite
is that the former is [RFC-4180-compliant](file-formats.md#csvtsvasvusvetc), while the latter
readily handles heterogeneous data (which is non-compliant). For example:

GENMD-RUN-COMMAND
cat data/het.json
Expand All @@ -200,10 +199,20 @@ GENMD-RUN-COMMAND
mlr --ijson --opprint group-like data/het.json
GENMD-EOF

Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use `--allow-ragged-csv-input` (or keystroke-saver `--ragged`).
GENMD-RUN-COMMAND
mlr --ijson --ocsvlite group-like data/het.json
GENMD-EOF

GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --csv --ragged cat data/het/ragged.csv
mlr --ijson --ocsv group-like data/het.json
GENMD-EOF

Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if
there are implicit header changes (no intervening blank line and new header line) as seen above --
you can use `--allow-ragged-csv-input` (or keystroke-saver `--ragged`).

GENMD-RUN-COMMAND
mlr --csv --allow-ragged-csv-input cat data/het/ragged.csv
GENMD-EOF

## Processing heterogeneous data
Expand Down
15 changes: 12 additions & 3 deletions pkg/output/channel_writer.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,11 @@ func channelWriterHandleBatch(
}

if record != nil {
recordWriter.Write(record, bufferedOutputStream, outputIsStdout)
err := recordWriter.Write(record, bufferedOutputStream, outputIsStdout)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: %v\n", err)
return true, true
}
}

outputString := recordAndContext.OutputString
Expand All @@ -111,8 +115,13 @@ func channelWriterHandleBatch(
// queued up. For example, PPRINT needs to see all same-schema
// records before printing any, since it needs to compute max width
// down columns.
recordWriter.Write(nil, bufferedOutputStream, outputIsStdout)
return true, false
err := recordWriter.Write(nil, bufferedOutputStream, outputIsStdout)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: %v\n", err)
return true, true
} else {
return true, false
}
}
}
return false, false
Expand Down
2 changes: 1 addition & 1 deletion pkg/output/record_writer.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,5 @@ type IRecordWriter interface {
outrec *mlrval.Mlrmap,
bufferedOutputStream *bufio.Writer,
outputIsStdout bool,
)
) error
}
Loading

0 comments on commit ac65675

Please sign in to comment.