question about parsing bytes in dag-json

I'm trying to implement a single-pass (no backtracking; no re-parsing or multipass hijinx) no-extraneous-allocations (I can do lookahead, but only if it's a fixed amount that's in no way data-dependent) parser for dag-json.  I'm having difficulty reconciling this goal with how dag-json specifies the serialization of bytes.

The [dag-json specs](https://github.com/ipld/specs/blob/d81caf2934d45a0efbfcc45b37b9b3da5286832a/block-layer/codecs/dag-json.md#bytes-kind) currently describe bytes as being serialized as:

```
{"/": { "bytes": String /* Multibase Base64 encoded binary */ }}
```

This is somewhat problematic.  Critically, there's no definition of when a parser may decide to reject this data as invalid dag-json, versus when deviation from this pattern must result in parsing as maps.

Assuming that we need to validate each of:

1. there's a map where the first key is "/"
2. the value for that key contains a map
3. the first key in this deeper map is "bytes"
4. its value is a string
5. there are no more elements in this map
6. there are no more elements in the outer map either

... then that's six tokens the parser needs to buffer before it can decide for sure that this segment of serial data should be interpreted as bytes.

If we have to support pivoting our interpretation of this data into maps right up until the last minute (say, if the sixth token is another map key in the outer map), then we really do need to **buffer** this.

Buffering six tokens is problematic: it's over a critical threshhold because we could hit the same buffering need **again**, before we clear out a buffer that's already started.  This lookahead process needs to start each time a map open token is countered.  By six tokens in, there's more than enough distance travelled that my parser might have encountered *another* token sequence which could require *six more* tokens of buffering to figure out.  This can happen recursively.  The result?  We can't do this lookahead in a fixed amount of memory.  (And frankly speaking, the parse logic just gets *nasty*.)

If we're allowed to reject data that "teased" (if you will) the parser into thinking there might be a bytes pattern coming up  -- for example some serial data that continues with more map content on either step 5 or 6 -- then we don't have the potentially-unlimited-buffer problem for our parser implementation; if there's no need to be able backtrack and replay stuff with an alternate interpretation, there's no need to buffer, and thus no difficulty.  However, the dag-json spec would need to clearly identify if this is allowed -- and exactly at _what steps_ this is allowed.

Defining exactly which steps are allowed to halt parsing and raise an error versus what steps must cause backtracking and reinterpretation as simple maps is extremely important, because there's a big difference between erroring for e.g. `{"/":{"bytes":12` vs erroring for _every map with a first entry that's "/"_ that doesn't turn into a CID or into bytes.

---

Sidetrack: this *isn't* actually a problem for the parsing of CIDs... which may be interesting to reflect on.

CIDs serialized in dag-json are document to look like:

```
{"/": String /* Base58 encoded CIDv0 or Multibase Base32 encoded CIDv1 */}
```

... which bears some resemblance to the how bytes are described, but has one extremely consequential difference: it takes fewer steps and fewer tokens to see from the beginning through to the end of this pattern.

1. there's a map where the first key is "/"
2. the value for that key contains a string
3. there are no more elements in this map

It's possible to process this pattern *without* potentially needing to re-engage the pattern recognition again before the previous recognition attempt has halted.  You can probably see how: by the time you can first possibly encounter a new map-open token which would require starting the pattern recognition again, you'd also simultaneously have flunked any current pattern recognition, because that position must contain a string for the outer pattern recognition to pass.

go-ipld-prime's dag-json implementation actually supports this; [the code is somewhat exciting](https://github.com/ipld/go-ipld-prime/blob/350032422383277e6545b9b1a49112123b5c43fb/codec/dagjson/unmarshal.go#L34-L128).

---

tl;dr the dag-json spec document currently only describes the happy-path for decoding and discovering encoded bytes.  What it needs to do is describe the step-by-step logic which can actual perform this recognition, as well as what to do when the recognition derails, **at each step of the process where it can derail**.

In deciding how each possible branch should be handled -- backtracking and reinterpretation, or erroring -- it may be a good idea for the spec to consider what those choices imply in terms of data that becomes invalid dag-json as well as what those choices imply about the complexity (and efficiency) of dag-json parsers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about parsing bytes in dag-json #302

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

question about parsing bytes in dag-json #302

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions