Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON Lines support #152

Open
dloss opened this issue Sep 22, 2022 · 3 comments
Open

Add JSON Lines support #152

dloss opened this issue Sep 22, 2022 · 3 comments

Comments

@dloss
Copy link

dloss commented Sep 22, 2022

The JSON Lines text format (aka JSONL or newline-delimited JSON) has one JSON object per line. It's often used for structured log files or as a well-specified alternative to CSV.

Here are some ideas how the JSON Lines format could be supported in GoAWK. To be honest I'm not completely sure if this is a good idea, but I've found it interesting to think about. This write-up captures some of my thoughts.

I can imagine different levels of sophistication. We could start simple and then in later versions support more complex input data and ways to interact with it.

One JSON array of scalars per line

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, null]
["Deloise", "2012A", 19, true] 

Suggestions:

  • Add a jsonl input mode.
  • Columns could be parsed to $1, $2, $3, ...
  • Error handling like with CSV

Questions:

  • How to handle JSON booleans (true/false) and null?
  • Does Unicode cause some problems?

One JSON object per line, with pairs of keys and scalar values

This is used by the Graylog Extended Log Format (GELF).

{"version":"1.1", "host":"example.org", "short_message": "A log message", "facility":"test", "_foo":"bar"}
{"version":"1.1", "host":"test.example.org", "short_message": "Another msg", "facility":"test", "_foo":"baz"}

Users wanting to parse Logfmt messages (like myself, see #149) should be able to convert their data into this format quite easily.

Suggestions:

  • Re-use existing named-field syntax to get the fields (e.g. @"short_message")
  • Update FIELDS array for each line. Don't expect all lines to have the same number or order of fields.

Nested data

{"one": 1, "four": [1,2,3,4], "five": {"alpha": ["fo", "fum"], "beta": {"hey": "How's tricks?"}}}
{"one": 1, "four": [4], "five": {"alpha": ["fa", "fim"], "beta": {"hi": "How's tracks?"}}}

Suggestions:

  • I guess we want to keep the syntax simple and not support something sophisticated like jsonpath or jmespath syntax to extract fields.
  • Maybe just return nested data as JSON strings: `@"four" -> "[1,2,3,4]"
  • Enhance named-field syntax with dots and square brackets to get subfields and array elements: @"five.alpha[1] returns "fum", "five.beta returns "{"hey": "How's tricks?"}" (Quoting issue, see below).
  • Add a function to map JSON array elements to AWK fields, e.g. getjsonarr("five.alpha"). Now $1 is fo, $2 is fum.
  • Maybe add a function to set a new root for field extraction, e.g. setjsonroot("five.beta"); print @"hey". returns How's tricks?
  • Use gron's collection of JSON testdata.

Questions:

  • How to escape double quotes in returned JSON strings?
  • How to map the first element of a JSON arrays to an AWK field? JSON arrays are 0-based, AWK-fields are 1-based.
@benhoyt
Copy link
Owner

benhoyt commented Oct 6, 2022

Thanks! I do intend to do a deep-dive into this, but just a few initial thoughts.

I hadn't considered your first example of "JSON array per line", just because "JSON object per line" is much more common. But that's perfectly valid and reasonable as a strongly typed CSV (well, really more like "slightly typed CSV"). I think JSON true should map to AWK 1 and false to AWK 0. As for JSON null, probably AWK null (what variables are initialized to, but basically acts as "" and 0 depending on context).

What would it do with non-scalar values? In other words, if an array or object was nested inside? Error? Just yield the JSON string? Ignore it? Replace with some placeholder like ""? I suppose for v1 we could say non-scalar values are undefined, and yield "" for now, with the possibility of extending it later.

I don't think Unicode causes problems. Everything's just UTF-8 in GoAWK.

And then "JSON object per line" maps very well to the GoAWK-specific @"field" syntax, as you say. Again, there's the problem of nested, non-scalar items. The @"foo.bar" or @"foo.bar[5]" type of syntax is tempting, but it would change the "row storage model" quite a bit -- not sure if that's an issue. Yes, jsonpath and jmespath seem significantly more complicated than we'd want here; just plan JavaScript .key and [index] notation would be enough. Though again, for v1 we could say non-scalar values are undefined, with the possibility of extending it later.

What would $1 and $2 mean in "JSON object per line" mode? (In fact, would it be a different mode than "JSON array per line", or would that be automatic?) With Go's JSON decoder to a map[string]any, it doesn't record key order. Go's encoding/json doesn't export a scanner, so we might have to build our own if we wanted key order. Then again, maybe for v1 we just disallow $n.

Not sure we need to escape double quotes in returned JSON strings if we end up doing that. Just yield the JSON-encoded string. Escaping is only an issue for string literals.

Yeah, the AWK $1 vs JavaScript [0] thing is interesting. I think for the @"foo[0]` notation it should be 0-based, given that it will be a subset of JavaScript notation and that's 0-based. A bit confusing either way.

Thanks for your thoughts on this. More another time!

@gedw99
Copy link

gedw99 commented Nov 1, 2022

Hey

https://github.com/tomnomnom/gron Is related in that it is a golang package to make json able to work with grep.

It looks like a potential base for hawk to support json ?

In the example fgrep is used. There is a basic golang implementation of grep here: https://github.com/u-root/u-root/blob/v0.10.0/cmds/core/grep/grep.go

fprep is as I understand it depreciated anyway

@janxkoci
Copy link

Hey, just an FYI that miller (written in Go) also supports JSONL (and JSON), maybe you can check the code there. The author notes that JSON parsing is generally more slow than the other supported formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants