Skip to content

Latest commit

 

History

History
500 lines (408 loc) · 23.2 KB

design.md

File metadata and controls

500 lines (408 loc) · 23.2 KB

A Journey building a fast JSON parser and full JSONPath, Oj for Go

I had a dream. I'd write a fast JSON parser, generic data, and a JSONPath implementation and it would be beautiful, well organized, and something to be admired. Well, reality kicked in and laughed at those dreams. A Go JSON parser and tools could be high performance but to get that performance compromises in beauty would have to be made. This is a tale of journey that ended with a Parser that leaves the Go JSON parser in the dust and resulted in some useful tools including a complete and efficient JSONPath implementation.

In all fairness I did embark on with some previous experience. Having written two JSON parser before. Both the Ruby Oj and the C parser OjC. Why not an OjG for go.

Planning

Like any journey it starts with the planning. Yeah, I know, it's called requirement gathering but casting it as planning a journey is more fun and this was all about enjoying the discoveries on the journey. The journey takes place in the land of OjG which stands for Oj for Go. Oj or Optimized JSON being a popular gem I wrote for Ruby.

First, JSON parsing and any frequently used operations such as JSONPath evaluation had to be fast over everything else. With the luxury of not having to follow the existing Go json package API the API could be designed for the best performance.

The journey would visit several areas each with its own landscape and different problems to solve.

Generic Data

The first visit was to generic data. Not to be confused with the proposed Go generics. Thats a completely different animal and has nothing to do with whats being referred to as generic data here. In building tools or packages for reuse the data acted on by those tools needs to be navigable.

Reflection can be used but that gets a bit tricky when dealing with private fields or field that can't be converted to something that can say be written as a JSON element. Other options are often better.

Another approach is to use simple Go types such as bool, int64, []any, and other types that map directly on to JSON or some other subset of all possible Go types. If too open, such as with []any it is still possible for the user to put unsupported types into the data. Not to pick out any package specifically but it is frustrating to see an argument type of any in an API and then no documentation describing that the supported types are.

There is another approach though: Define a set of types that can be in a collection and use those types. With this approach, the generic data implementation has to support the basic JSON types of null, boolean, int64, float64, string, array, and object. In addition time should be supported. From experience in both JSON use in Ruby and Go time has always been needed. Time is just too much a part of any set of data to leave it out.

The generic data had to be type safe. It would not do to have an element that could not be encoded as JSON in the data.

A frequent operation for generic data is to store that data into a JSON database or similar. That meant converting to simple Go types of nil, bool, int64, float64, string, []any, and map[string]any had to be fast.

Also planned for this part of the journey was methods on the types to support getting, setting, and deleting elements using JSONPath. The hope was to have an object based approach to the generic nodes so something like the following could be used but keeping generic data, JSONPath, and parsing in separate packages.

    var n gen.Node
    n = gen.Int(123)
    i, ok := n.AsInt()

Unfortunately that part of the journey had to be cancelled as the Go travel guide refuses to let packages talk back and forth. Imports are one way only. After trying to put all the code in one package it eventually got unwieldy. Function names started being prefixed with what should really have been package names so the object and method approach was dropped. A change in API but the journey would continue.

JSON Parser and Validator

The next stop was the parser and validator. After some consideration it seemed like starting with the validator would be best way to become familiar with the territory. The JSON parser and validator need not be the same and each should be as performant as possible. The parsers needed to support parsing into simple Go types as well as the generic data types.

When parsing files that include millions or more JSON elements in files that might be over 100GB a streaming parser is necessary. It would be nice to share some code with both the streaming and string parsers of course. It's easier to pack light when the areas are similar.

The parser must also allow parsing into native Go types. Furthermore interfaces must be supported even though Go unmarshalling does not support interface fields. Many data types make use of interfaces that limitation was not acceptable for the OjG parser. A different approach to support interfaces was possible.

JSON documents of any non-trivial size, especially if hand-edited, are likely to have errors at some point. Parse errors must identify where in the document the error occurred.

JSONPath

Saving the most interesting part of the trip for last, the JSONPath implementation promised to have all sorts of interesting problems to solve with descents, wildcards, and especially filters.

A JSONPath is used to extract elements from data. That part of the implementation had to be fast. Parsing really didn't have to be fast but it would be nice to have a way of building a JSONPath in a performant manner even if it was not as convenient as parsing a string.

The JSONPath implementation had to implement all the features described by the Goessner article. There are other descriptions of JSONPath but the Goessner description is the most referenced. Since the implementation is in Go the scripting feature described could be left out as long as similar functionality could be provided for array indexes relative to the length of the array. Borrowing from Ruby, using negative indexes would provide that functionality.

The Journey

The journey unfolded as planned to a degree. There were some false starts and revisits but eventually each destination was reached and the journey completed.

Generic Data (gen package)

What better way to make generic type fast than to just define generic types from simple Go types and then add methods on those types? A gen.Int is just an int64 and a gen.Array is just a []gen.Node. With that approach there are no extra allocations.

type Node any
type Int int64
type Array []Node

Since generic arrays and objects restrict the type of the values in each collection to gen.Node types the collections are assured to contain only elements that can be encoded as JSON.

Methods on the Node could not be implemented without import loops so the number of functions in the Node interface were limited. It was clear a parser specific to the generic data type would be needed but that would have to wait until the parser part of the journey was completed. Then the generic data package could be revisited and the parser explored.

Peeking at the future to the generic data parser revisit it was not very interesting after the deep dive into the simple data parser. The parser for generic types is a copy of the oj package parser but instead of simple types being created instances that support the gen.Node interface are created.

Simple Parser (oj package)

Looking back its hard to say what was the most interesting part of the journey, the parser or JSONPath. Each had their own unique set of issues. The parser was the best place to start though as some valuable lessons were learned about what to avoid and what to gravitate toward in trying to achieve high performance Go code.

Validator

From the start I knew that a single pass parser would be more efficient than building tokens and then making a second pass to decide what the tokens means. At least that approach as worked well in the past. I dived in and used a readValue function that branched depending on the next character read. It worked but it was slower than the target of being on par with the Go json.Validate. That was the bar to clear. The first attempt was off by a lot. Of course a benchmark was needed to verify that so the cmd/benchmark command was started. Profiling didn't help much. It turned out since much of the overhead was in the function call setup which isn't obvious when profiling.

Not knowing at the time that function calls were so expensive but anticipating that there was some overhead in function calls I moved some of the code from a few frequently called functions to be inline in the calling function. That made much more of a difference than I expected. At that point I looked at the Go code for the core validation code. I was surprised to see that it used lots of functions but not functions attached to a type. I gave that approach a try but with functions on the parser type. The results were not good either. Simply changing the functions to take the parser as an argument made a big difference though. Another lesson learned.

Next was to remove function calls as much as possible since they did seem to be expensive. The code was no longer elegant and had lots of duplicated blocks but it ran much faster. At this point the code performance was getting closer to clearing the Go validator bar.

When parsing in a single pass a conceptual state machine is generally used. When branching with functions there is still a state machine but the states are limited for each function making it much easier to follow. Moving into a single function meant tracking many more states in single state machine. Implementation was with lengthy switch statements. One problem remained though. Array and Object had to be tracked to know when a closing ] or } was allowed. Since function calls were being avoided that meant maintaining a stack in the single parser function. That approach worked well with very little overhead.

Another tweak was to reuse memory. At this point the parser only allocated a few objects but why would it need to allocate any if the buffers for the stack could be reused. That prompted a change in the API. The initial API was for a single Validate() function. If the validator was made public it could be reused. That made a lot of sense since often similar data is parsed or validated by the same application. That change was enough to reduce the allocations per validation to zero and brought the performance under the Go json.Valid() bar.

Parser

With the many optimum techniques identified while visiting the validator, the next part of the journey was to use those same technique on the parser.

The difference between the validator and the parser is that the parser needs to build up data elements. The first attempt was to add the bytes associated with a value to a reusable buffer and then parse that buffer at the end of the value bytes in the source. It worked and was as fast as the json.Unmarshall function but that was not enough as there were still more allocations than seemed necessary.

By expanding the state machine null, true, and false could be identified as values without adding to the buffer. That gave a bit of improvement.

Numbers, specifically integers, were another value type that really didn't need to be parsed from a buffer so instead of appending bytes to a buffer and calling strconv.ParseInt(), integer values were built as an int64 and grown as bytes were read. If a . character is encountered then the number is a decimal so the type expected is changed and each part of a float is captured as integers and finally a float64 is created when done. This was another improvement in performance.

Not much could be done to improve string parsing since it is really just appending bytes to a buffer and making them a string at the final ". Each byte being appended needed to be checked though. A byte map in the form of a 256 long bytes array is used for that purpose.

Going back to the stack used in the validator, instead of putting a simple marker on the stack like the validator, when an Object start character, a { is encountered a new map[string]any is put on the stack. Values and keys are then used to set members of the map. Nothing special there.

Saving the best for last, arrays were tougher to deal with. A value is not just added to an array but rather appended to an array and a potentially new array is returned. Thats not a terribly efficient way to build a slice as it will go through multiple reallocations. Instead, a second slice index stack is kept. As an array is to be created, a spot is reserved on the stack and the index of that stack location is placed on the slice index stack. After that values are pushed onto the stack until an array close character ] is reached. The slice index is then referenced and a new []any is allocated for all the values from the arry index on the stack to the end of the stack. Values are copied to the new array and the stack is collapsed to the index. A bit complicated but it does save multiple object allocations.

After some experimentation it turned out that the overhead of some operations such as creating a slice or adding a number were not impacted to any large extent by making a function call since it does not happen as frequently as processing each byte. Some use of functions could therefor be used to remove duplicate code without incurring a significant performance impact.

One stop left at the parser package tour. Streaming had to be supported. At this point there were already plans on how to deal with streaming which was to load up a buffer and iterate over that buffer using the exact same code as for parsing bytes and repeat until there was nothing left to read. It seemed like using an index into the buffer would be easier to keep track of but switching from a for range to for i = 0; i < size; i++ { dropped the performance considerably. Clearly staying with the range approach was better. Once that was working a quick trip back to the validator to allow it to support streams was made.

Stream parsing or parsing a string with multiple JSON documents in it is best handled with a callback function. That allows the caller to process the parsed document and move on without incurring any additional memory allocations unless needed.

The stay at the validator and parser was fairly lengthy at a bit over a month of evening coding.

JSONPath (jp package)

The visit to JSONPath would prove to be a long stay as well with a lot more creativity for some tantalizing problems.

The first step was to get a language and cultural refresher on JSONPath terms and behavior. From that it was decided that a JSONPath would be represented by a jp.Expr which is composed of fragments or jp.Frag objects. Keeping with the guideline of minimizing allocations the jp.Expr is just a slice of jp.Frag. In most cases expressions are defined statically so the parser need not be fast. No special care was taken to make the JSONPath parser fast. Instead functions are used in an approach that is easier to understand. I said easier, not easy. There are a fair number of dangerous curves with trying to support bracketed notation as well as dot notation and how that all plays nicely with the script parser so that one can call the other to support nested filters. It was rewarding to see it all come together though.

If the need exists to create expressions at run time then functions are used that allow them to be constructed more easily. That makes for a lot of functions. I also like to be able to keep code compact and figured others might too so each fragment type can also be created with a single letter function. They don't have to be used but they exist to support building expressions as a chain.

    x := jp.R().D().C("abc").W().C("xyz").N(3)
    fmt.Println(x.String())
    // $..abc.*.xyz[3]

contrasted with the use of the JSONPath parser:

    x, err := jp.ParseString("$..abc.*.xyz[3]")
    // check err first
    fmt.Println(x.String())
    // $..abc.*.xyz[3]

Evaluating an expression against data involves walking down the data tree to find one or more elements. Conceptually each fragment of a path sets up zero or more paths to follow through the data. When the last fragment is reached the search is done. A recursive approach would be ideal where the evaluation of one fragment then invokes the next fragment's eval function with as many paths it matches. Great on paper but for something like a descent fragment (..) that is a lot of function calls.

Given that function calls are expensive and slices are cheap a Forth (the language) evaluation stack approach is used. Not exactly Forth but a similar concept mixing data and operators. Each fragment takes its matches and those matches already on the stack. Then the next fragment evaluates each in turn. This continues until the stack shrinks back to one element indicating the evaluation is complete. The last fragment puts any matches on a results list which is returned instead of on the stack.

Stack Frag
{a:3} data
'a' Child

One fragment type is a filter which looks like [?(@.x == 3)]. This requires a script or function evaluation. A similar stack based approach is used for evaluating scripts. Note that scripts can and almost always contain a JSONPath expression starting with a @ character. An interesting aspect of this is that a filter can contain other filters. OjG supports nested filters.

The most memorable part of the JSONPath part of the journey had to be the evaluation stack. That worked out great and was able to support all the various fragment types.

Converting or Altering Data (alt package)

A little extra was added to the journey once it was clear the generic data types would not support JSONPath directly. The original plan was to has functions like AsInt() as part of the Node interface. With that no longer reasonable an alt package became part of the journey. It would be used for converting types as well as altering existing ones. To make the last part of the trip even more interesting the alt package is where marshalling and unmarshalling types came into play but under the names of recompose and decompose since operations were to take Go types and decompose those objects into simple or generic data. The reverse is to recompose the simple data back into their original types. This takes an approach used in Oj for Ruby when the type name is encoded in the decomposed data. Since the data type is included in the data itself it is self describing and can be used to recompose types that include interface members.

There is a trade off in that JSON is not parsed directly to a Go type by must go through an intermediate data structure first. There is an up side to that as well though. Now any simple or generic data can be used to recompose objects and not just JSON strings.

The alt.GenAlter() function was interesting in that it is possible to modify a slice type and then reset the members without reallocating.

Thats the last stop of the journey.

Lessons Learned

Benchmarking was instrumental to tuning and picking the most favorable approach to the implementation. Through those benchmarks a number of lessons were learned. The final benchmarks results can be viewed by running the cmd/benchmark command. See the results at benchmarks.md.

Here is a snippet from the benchmarks. Note higher is better for the numbers in parenthesis which is a ratio of the OjG component to Go json package component.

Parse JSON
json.Unmarshal:           7104 ns/op (1.00x)    4808 B/op (1.00x)      90 allocs/op (1.00x)
  oj.Parse:               4518 ns/op (1.57x)    3984 B/op (1.21x)      86 allocs/op (1.05x)
  oj.GenParse:            4623 ns/op (1.54x)    3984 B/op (1.21x)      86 allocs/op (1.05x)

json.Marshal:             2616 ns/op (1.00x)     992 B/op (1.00x)      22 allocs/op (1.00x)
  oj.JSON:                 436 ns/op (6.00x)     131 B/op (7.57x)       4 allocs/op (5.50x)
  oj.Write:                455 ns/op (5.75x)     131 B/op (7.57x)       4 allocs/op (5.50x)

Functions Add Overhead

Sure we all know a function call add some overhead in any language. In C that overhead is pretty small or nonexistent with inline functions. That is not true for Go. There is considerable overhead in making a function call and if that functional call included any kind of context such as being the function of a type the overhead is even higher. That observation (while disappointing) drove a lot of the parser and JSONPath evaluation code. For nice looking and well organized code using functions are highly recommended but for high perfomance find a way to reduce function calls.

The implementation of the parser included a lot of duplicate code to reduce function calls and it did make a significant difference in performance.

The JSONPath evaluation takes an additional approach. It includes a fair amount of code duplication but it also implements its own stack to avoid nested functional calls even though the nature of the evaluation is a better match for a recursive implementation.

Slices are Nice

Slices are implemented very efficiently in Go. Appending to a slice has very little overhead. Reusing slices by collapsing them to zero length is a great way to avoid allocating additional memory. Care has to be taken when collapsing though as any cells in the slice that point to objects will now leave those objects dangling or rather referenced but not reachable and they will never be garbage collected. Simply setting the slice slot to nil will avoid memory leaks.

Memory Allocation

Like most languages, memory allocation adds overhead. It's best to avoid when possible. A good example of that is in the alt package. The Alter() function replaces slice and map members instead of allocating a new slice or map when possible.

Parsers take advantage by reusing buffers and avoiding allocation of token during possible when possible.

Range Has Been Optimized

Using a for range loop is better than incrementing an index. The difference was not huge but was noticable.

APIs Matter

It's important to define an API that is easy to use as well as one that allows for the best performance. The parser as well as the JSONPath builders attempt to do both. An even better example is the GGql GraphQL package. It provides a very simple API when compared to previous Go GraphQL packages and it is many times faster.

Whats Next?

Theres alway something new ready to be explored. For OjG there are a few things in the planning stage.

  • A short trip to Regex filters for JSONPath.
  • A construction project to add JSON building to the oj command which is an alternative to jq but using JSONPath.
  • Explore new territory by implementing a Simple Encoding Notation which mixes GraphQL syntax with JSON for simpler more forgiving format.
  • A callback parser along the lines of the Go json.Decoder or more likely like the Oj Simple Callback Parser.

Discuss this on Changelog News.