|
| 1 | +# Compressed Serialization |
| 2 | + |
| 3 | +With the Chia hard fork at height 5'496'000, CLVM can be serialized in a more |
| 4 | +space efficient form, referring back to previous sub-trees instead of |
| 5 | +duplicating them. These references are referred to as "back references". |
| 6 | + |
| 7 | +## Format |
| 8 | + |
| 9 | +The original serialization format had 3 tokens. |
| 10 | + |
| 11 | +- `0xff` - a pair, followed by the left and right sub-trees. |
| 12 | +- Atom - values, in the form of an array of bytes. For more details, see [CLVM |
| 13 | + serialization](https://chialisp.com/clvm/#serialization). |
| 14 | + |
| 15 | +A back reference is introduced by `0xfe` followed by an atom. The atom refers |
| 16 | +back to an already decoded sub tree. The bits are interpreted just like an |
| 17 | +environment lookup in CLVM. The bits are inspected one at a time, from least |
| 18 | +significant to most significant bits, in big-endian order. |
| 19 | + |
| 20 | +## Paths |
| 21 | + |
| 22 | +``` |
| 23 | + +----------+----------+----------+----------+ |
| 24 | +byte index: | byte 0 | byte 1 | byte 2 | byte 3 | |
| 25 | + +----------+----------+----------+----------+ |
| 26 | + bit index: | 76543210 | 76543210 | 76543210 | 76543210 | |
| 27 | + +----------+----------+----------+----------+ |
| 28 | +
|
| 29 | +bit traversal direction: <- x |
| 30 | +``` |
| 31 | + |
| 32 | +A `0` bit means follow the left sub-tree while a `1` bit means follow the right |
| 33 | +sub-tree. The last 1-bit is the terminator, and means we should pick the node at |
| 34 | +the current location in the tree. |
| 35 | + |
| 36 | +e.g. The reference `0b1011` means: |
| 37 | + |
| 38 | +- right |
| 39 | +- right |
| 40 | +- left |
| 41 | +- (terminator bit) |
| 42 | + |
| 43 | +It follows the path below: |
| 44 | + |
| 45 | +``` |
| 46 | + [*] |
| 47 | + / \ |
| 48 | + / \ |
| 49 | + / \ 1 |
| 50 | + / \ |
| 51 | + / \ |
| 52 | + / \ |
| 53 | + [ ] [*] |
| 54 | + / \ / \ 1 |
| 55 | + / \ / \ |
| 56 | + [ ] [ ] [ ] [*] |
| 57 | + / \ / \ / \ 0 / \ |
| 58 | + [ ] [ ] [ ] [ ] [ ] [ ] [*] [ ] |
| 59 | +``` |
| 60 | + |
| 61 | +How environment lookups work is also described in the |
| 62 | +[chialisp documentation](https://chialisp.com/clvm/#environment). |
| 63 | + |
| 64 | +## Parsing |
| 65 | + |
| 66 | +Back references refer into the "parse stack". This is a CLVM tree that's updated |
| 67 | +as we parse, so what a back reference refers to changes as we parse the |
| 68 | +serialized CLVM tree. To understand what the parse stack is, we first need to |
| 69 | +look at how CLVM is parsed. |
| 70 | + |
| 71 | +The parser has a stack of _operations_ and a stack of the parsed results (the |
| 72 | +parse stack). |
| 73 | + |
| 74 | +There are 2 operations that can be pushed onto the operations stack: |
| 75 | + |
| 76 | +- `Cons` - Construct a pair (cons box) |
| 77 | +- `Traverse` - parse a sub-tree |
| 78 | + |
| 79 | +As outlined in the [Format](#Format) section, there are two tokens we can |
| 80 | +encounter when parsing; an atom or a pair (followed by the left- and right |
| 81 | +sub-trees). |
| 82 | + |
| 83 | +We keep popping operations off of the op-stack until it's empty. We take the |
| 84 | +following actions depending on the operation: |
| 85 | + |
| 86 | +- `Traverse`, inspect the next byte of the input stream. If it's a pair (`0xff`) |
| 87 | + we push `Cons`, `Traverse`, `Traverse` onto the operations stack. If it's an |
| 88 | + atom, parse the atom and push it into the parse stack. |
| 89 | + |
| 90 | +- `Cons`, pop two nodes from the parse stack, create a new pair with those nodes |
| 91 | + as the left and right side. Push the resulting pair onto the stack. |
| 92 | + |
| 93 | +### Example |
| 94 | + |
| 95 | +To parse the tokens: `0xff` `1` `0xff` `2` `foobar`, the two stacks end up like |
| 96 | +this while parsing. The stacks grow to the right in this illustration. |
| 97 | + |
| 98 | +| step | op-stack | parse-stack | |
| 99 | +| ----------------- | ------------------------------ | ------------------------ | |
| 100 | +| 1, initial state | Traverse | | |
| 101 | +| 2, parse `0xff` | Cons, Traverse, Traverse | | |
| 102 | +| 3, parse `1` | Cons, Traverse | `1` | |
| 103 | +| 4, parse `0xff` | Cons, Cons, Traverse, Traverse | `1` | |
| 104 | +| 5, parse `2` | Cons, Cons, Traverse | `1`, `2` | |
| 105 | +| 6, parse `foobar` | Cons, Cons | `1`, `2`, `foobar` | |
| 106 | +| 7, pop2 and cons | Cons | `1`, (`2` . `foobar`) | |
| 107 | +| 8, pop2 and cons | | (`1` . (`2` . `foobar`)) | |
| 108 | + |
| 109 | +## Parse stack |
| 110 | + |
| 111 | +When a back-reference token (`0xfe`) is encountered, the parse stack in that |
| 112 | +current state is used as the environment for the back-reference path to look up |
| 113 | +what node to place at this position in the resulting tree. |
| 114 | + |
| 115 | +The parse stack is itself a LISP list of items. The top of the stack is the head |
| 116 | +of the list. |
| 117 | + |
| 118 | +e.g. |
| 119 | + |
| 120 | +The stack `1`, `2`, `3`, would have the following LISP structure: |
| 121 | + |
| 122 | +``` |
| 123 | +(`1` . (`2` . (`3` . NIL))) |
| 124 | +``` |
| 125 | + |
| 126 | +A back reference to `3` would be: `0b1100` (right, left). |
| 127 | + |
| 128 | +### Example back-reference |
| 129 | + |
| 130 | +Consider the following LISP structure: ((`1` . `2`) . (`1` . `2`)) |
| 131 | +It can be serialized as `0xff` `0xff` `1` `2` `0xfe` `0b10` |
| 132 | + |
| 133 | +The parsing steps would be as follows: |
| 134 | + |
| 135 | +| step | op-stack | parse-stack | |
| 136 | +| ------------------- | ---------------------------------------- | --------------------------- | |
| 137 | +| 1, initial state | Traverse | | |
| 138 | +| 2, parse `0xff` | Cons, Traverse, Traverse | | |
| 139 | +| 3, parse `0xff` | Cons, Traverse, Cons, Traverse, Traverse | | |
| 140 | +| 4, parse `1` | Cons, Traverse, Cons, Traverse | `1` | |
| 141 | +| 5, parse `2` | Cons, Traverse, Cons | `1`, `2` | |
| 142 | +| 6, pop2 and cons | Cons, Traverse | (`1` . `2`) | |
| 143 | +| 7, parse `0xfe` `2` | Cons | (`1` . `2`), (`1` . `2`) | |
| 144 | +| 8, pop2 and cons | | ((`1` . `2`) . (`1` . `2`)) | |
| 145 | + |
| 146 | +### Referencing the stack itself |
| 147 | + |
| 148 | +Back references aren't limited to just referencing items in the stack, but can |
| 149 | +reference any node in the stack. For example, consider parsing the following |
| 150 | +structure: |
| 151 | + |
| 152 | +`0xff` `foobar` `0xff` `foobar` NIL |
| 153 | + |
| 154 | +| step | op-stack | parse-stack | |
| 155 | +| ----------------- | ------------------------ | ----------- | |
| 156 | +| 1, initial state | Traverse | | |
| 157 | +| 2, parse `0xff` | Cons, Traverse, Traverse | | |
| 158 | +| 3, parse `foobar` | Cons, Traverse | `foobar` | |
| 159 | + |
| 160 | +At this point, rather than parsing the next `0xff` pair, we could have a back |
| 161 | +reference (`0xfe`) with a path pointing to the root of the parse stack. In LISP |
| 162 | +form, the parse stack will be (`foobar` . NIL) - a list with one item. The rest |
| 163 | +of the CLVM tree is just the second `foobar` followed by the list terminator. It |
| 164 | +can be replaced with the parse stack itself. i.e. We can use a back-reference of |
| 165 | +`1`. We then get the NIL and the cons box "for free". It's implied by the parse |
| 166 | +stack. |
| 167 | + |
| 168 | +In this scenario, the rest of the parsing steps are: |
| 169 | + |
| 170 | +| step | op-stack | parse-stack | |
| 171 | +| ------------------- | -------- | ----------------------------- | |
| 172 | +| 4, parse `0xfe` `1` | Cons | `foobar`, (`foobar` . NIL) | |
| 173 | +| 5, pop2 and cons | | (`foobar` . (`foobar` . NIL)) | |
| 174 | + |
| 175 | +In practice, however, this rarely happens. |
| 176 | + |
| 177 | +## Generating back references |
| 178 | + |
| 179 | +When serializing with compression, we need to assign a tree-hash and an |
| 180 | +(uncompressed) serialized length to every node. When deciding whether to output |
| 181 | +the sub-tree itself or a back-reference, we need to know whether we have already |
| 182 | +serialized an identical sub tree. If we have, we then have to perform a search |
| 183 | +from that node up all of its parents until we reach the top of the parse stack. |
| 184 | +This requires a data structure that knows about the parents of all nodes. |
| 185 | + |
| 186 | +This search is performed in `find_path()`. There may be multiple paths leading |
| 187 | +to the stack (if the same structure is repeated in multiple places). We pick the |
| 188 | +_shortest_ path. This path may still be quite long, if the stack is deep or if |
| 189 | +the node is found deep down in a CLVM structure. We need to compare the length |
| 190 | +of the path against the serialized-length of the subtree. If the path is longer, |
| 191 | +it would be a net loss to replace it with a back reference. |
| 192 | + |
| 193 | +During serialization, we need to track what the parse-stack will look like when |
| 194 | +deserializing, since this is part of the structure we need to search through |
| 195 | +when finding paths to previous sub trees. |
0 commit comments