Add JSON Lines support #152

dloss · 2022-09-22T20:53:33Z

The JSON Lines text format (aka JSONL or newline-delimited JSON) has one JSON object per line. It's often used for structured log files or as a well-specified alternative to CSV.

Here are some ideas how the JSON Lines format could be supported in GoAWK. To be honest I'm not completely sure if this is a good idea, but I've found it interesting to think about. This write-up captures some of my thoughts.

I can imagine different levels of sophistication. We could start simple and then in later versions support more complex input data and ways to interact with it.

One JSON array of scalars per line

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, null]
["Deloise", "2012A", 19, true]

Suggestions:

Add a jsonl input mode.
Columns could be parsed to $1, $2, $3, ...
Error handling like with CSV

Questions:

How to handle JSON booleans (true/false) and null?
Does Unicode cause some problems?

One JSON object per line, with pairs of keys and scalar values

This is used by the Graylog Extended Log Format (GELF).

{"version":"1.1", "host":"example.org", "short_message": "A log message", "facility":"test", "_foo":"bar"}
{"version":"1.1", "host":"test.example.org", "short_message": "Another msg", "facility":"test", "_foo":"baz"}

Users wanting to parse Logfmt messages (like myself, see #149) should be able to convert their data into this format quite easily.

Suggestions:

Re-use existing named-field syntax to get the fields (e.g. @"short_message")
Update FIELDS array for each line. Don't expect all lines to have the same number or order of fields.

Nested data

{"one": 1, "four": [1,2,3,4], "five": {"alpha": ["fo", "fum"], "beta": {"hey": "How's tricks?"}}}
{"one": 1, "four": [4], "five": {"alpha": ["fa", "fim"], "beta": {"hi": "How's tracks?"}}}

Suggestions:

I guess we want to keep the syntax simple and not support something sophisticated like jsonpath or jmespath syntax to extract fields.
Maybe just return nested data as JSON strings: `@"four" -> "[1,2,3,4]"
Enhance named-field syntax with dots and square brackets to get subfields and array elements: @"five.alpha[1] returns "fum", "five.beta returns "{"hey": "How's tricks?"}" (Quoting issue, see below).
Add a function to map JSON array elements to AWK fields, e.g. getjsonarr("five.alpha"). Now $1 is fo, $2 is fum.
Maybe add a function to set a new root for field extraction, e.g. setjsonroot("five.beta"); print @"hey". returns How's tricks?
Use gron's collection of JSON testdata.

Questions:

How to escape double quotes in returned JSON strings?
How to map the first element of a JSON arrays to an AWK field? JSON arrays are 0-based, AWK-fields are 1-based.

The text was updated successfully, but these errors were encountered:

benhoyt · 2022-10-06T09:08:42Z

Thanks! I do intend to do a deep-dive into this, but just a few initial thoughts.

I hadn't considered your first example of "JSON array per line", just because "JSON object per line" is much more common. But that's perfectly valid and reasonable as a strongly typed CSV (well, really more like "slightly typed CSV"). I think JSON true should map to AWK 1 and false to AWK 0. As for JSON null, probably AWK null (what variables are initialized to, but basically acts as "" and 0 depending on context).

What would it do with non-scalar values? In other words, if an array or object was nested inside? Error? Just yield the JSON string? Ignore it? Replace with some placeholder like ""? I suppose for v1 we could say non-scalar values are undefined, and yield "" for now, with the possibility of extending it later.

I don't think Unicode causes problems. Everything's just UTF-8 in GoAWK.

And then "JSON object per line" maps very well to the GoAWK-specific @"field" syntax, as you say. Again, there's the problem of nested, non-scalar items. The @"foo.bar" or @"foo.bar[5]" type of syntax is tempting, but it would change the "row storage model" quite a bit -- not sure if that's an issue. Yes, jsonpath and jmespath seem significantly more complicated than we'd want here; just plan JavaScript .key and [index] notation would be enough. Though again, for v1 we could say non-scalar values are undefined, with the possibility of extending it later.

What would $1 and $2 mean in "JSON object per line" mode? (In fact, would it be a different mode than "JSON array per line", or would that be automatic?) With Go's JSON decoder to a map[string]any, it doesn't record key order. Go's encoding/json doesn't export a scanner, so we might have to build our own if we wanted key order. Then again, maybe for v1 we just disallow $n.

Not sure we need to escape double quotes in returned JSON strings if we end up doing that. Just yield the JSON-encoded string. Escaping is only an issue for string literals.

Yeah, the AWK $1 vs JavaScript [0] thing is interesting. I think for the @"foo[0]` notation it should be 0-based, given that it will be a subset of JavaScript notation and that's 0-based. A bit confusing either way.

Thanks for your thoughts on this. More another time!

gedw99 · 2022-11-01T09:13:36Z

Hey

https://github.com/tomnomnom/gron Is related in that it is a golang package to make json able to work with grep.

It looks like a potential base for hawk to support json ?

In the example fgrep is used. There is a basic golang implementation of grep here: https://github.com/u-root/u-root/blob/v0.10.0/cmds/core/grep/grep.go

fprep is as I understand it depreciated anyway

janxkoci · 2024-07-26T12:03:42Z

Hey, just an FYI that miller (written in Go) also supports JSONL (and JSON), maybe you can check the code there. The author notes that JSON parsing is generally more slow than the other supported formats.

benhoyt mentioned this issue Jul 13, 2023

Missing support for json input files #196

Closed

benhoyt mentioned this issue Mar 8, 2024

Support for network? #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON Lines support #152

Add JSON Lines support #152

dloss commented Sep 22, 2022 •

edited by benhoyt

Loading

benhoyt commented Oct 6, 2022

gedw99 commented Nov 1, 2022 •

edited

Loading

janxkoci commented Jul 26, 2024

Add JSON Lines support #152

Add JSON Lines support #152

Comments

dloss commented Sep 22, 2022 • edited by benhoyt Loading

One JSON array of scalars per line

One JSON object per line, with pairs of keys and scalar values

Nested data

benhoyt commented Oct 6, 2022

gedw99 commented Nov 1, 2022 • edited Loading

janxkoci commented Jul 26, 2024

dloss commented Sep 22, 2022 •

edited by benhoyt

Loading

gedw99 commented Nov 1, 2022 •

edited

Loading