Skip to content

Commit 2eee1cf

Browse files
authored
Merge pull request #527 from Chia-Network/compression-docs
document how CLVM compression works and its format
2 parents 4a2ce96 + f2e5e47 commit 2eee1cf

File tree

1 file changed

+195
-0
lines changed

1 file changed

+195
-0
lines changed

docs/compressed-serialization.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Compressed Serialization
2+
3+
With the Chia hard fork at height 5'496'000, CLVM can be serialized in a more
4+
space efficient form, referring back to previous sub-trees instead of
5+
duplicating them. These references are referred to as "back references".
6+
7+
## Format
8+
9+
The original serialization format had 3 tokens.
10+
11+
- `0xff` - a pair, followed by the left and right sub-trees.
12+
- Atom - values, in the form of an array of bytes. For more details, see [CLVM
13+
serialization](https://chialisp.com/clvm/#serialization).
14+
15+
A back reference is introduced by `0xfe` followed by an atom. The atom refers
16+
back to an already decoded sub tree. The bits are interpreted just like an
17+
environment lookup in CLVM. The bits are inspected one at a time, from least
18+
significant to most significant bits, in big-endian order.
19+
20+
## Paths
21+
22+
```
23+
+----------+----------+----------+----------+
24+
byte index: | byte 0 | byte 1 | byte 2 | byte 3 |
25+
+----------+----------+----------+----------+
26+
bit index: | 76543210 | 76543210 | 76543210 | 76543210 |
27+
+----------+----------+----------+----------+
28+
29+
bit traversal direction: <- x
30+
```
31+
32+
A `0` bit means follow the left sub-tree while a `1` bit means follow the right
33+
sub-tree. The last 1-bit is the terminator, and means we should pick the node at
34+
the current location in the tree.
35+
36+
e.g. The reference `0b1011` means:
37+
38+
- right
39+
- right
40+
- left
41+
- (terminator bit)
42+
43+
It follows the path below:
44+
45+
```
46+
[*]
47+
/ \
48+
/ \
49+
/ \ 1
50+
/ \
51+
/ \
52+
/ \
53+
[ ] [*]
54+
/ \ / \ 1
55+
/ \ / \
56+
[ ] [ ] [ ] [*]
57+
/ \ / \ / \ 0 / \
58+
[ ] [ ] [ ] [ ] [ ] [ ] [*] [ ]
59+
```
60+
61+
How environment lookups work is also described in the
62+
[chialisp documentation](https://chialisp.com/clvm/#environment).
63+
64+
## Parsing
65+
66+
Back references refer into the "parse stack". This is a CLVM tree that's updated
67+
as we parse, so what a back reference refers to changes as we parse the
68+
serialized CLVM tree. To understand what the parse stack is, we first need to
69+
look at how CLVM is parsed.
70+
71+
The parser has a stack of _operations_ and a stack of the parsed results (the
72+
parse stack).
73+
74+
There are 2 operations that can be pushed onto the operations stack:
75+
76+
- `Cons` - Construct a pair (cons box)
77+
- `Traverse` - parse a sub-tree
78+
79+
As outlined in the [Format](#Format) section, there are two tokens we can
80+
encounter when parsing; an atom or a pair (followed by the left- and right
81+
sub-trees).
82+
83+
We keep popping operations off of the op-stack until it's empty. We take the
84+
following actions depending on the operation:
85+
86+
- `Traverse`, inspect the next byte of the input stream. If it's a pair (`0xff`)
87+
we push `Cons`, `Traverse`, `Traverse` onto the operations stack. If it's an
88+
atom, parse the atom and push it into the parse stack.
89+
90+
- `Cons`, pop two nodes from the parse stack, create a new pair with those nodes
91+
as the left and right side. Push the resulting pair onto the stack.
92+
93+
### Example
94+
95+
To parse the tokens: `0xff` `1` `0xff` `2` `foobar`, the two stacks end up like
96+
this while parsing. The stacks grow to the right in this illustration.
97+
98+
| step | op-stack | parse-stack |
99+
| ----------------- | ------------------------------ | ------------------------ |
100+
| 1, initial state | Traverse | |
101+
| 2, parse `0xff` | Cons, Traverse, Traverse | |
102+
| 3, parse `1` | Cons, Traverse | `1` |
103+
| 4, parse `0xff` | Cons, Cons, Traverse, Traverse | `1` |
104+
| 5, parse `2` | Cons, Cons, Traverse | `1`, `2` |
105+
| 6, parse `foobar` | Cons, Cons | `1`, `2`, `foobar` |
106+
| 7, pop2 and cons | Cons | `1`, (`2` . `foobar`) |
107+
| 8, pop2 and cons | | (`1` . (`2` . `foobar`)) |
108+
109+
## Parse stack
110+
111+
When a back-reference token (`0xfe`) is encountered, the parse stack in that
112+
current state is used as the environment for the back-reference path to look up
113+
what node to place at this position in the resulting tree.
114+
115+
The parse stack is itself a LISP list of items. The top of the stack is the head
116+
of the list.
117+
118+
e.g.
119+
120+
The stack `1`, `2`, `3`, would have the following LISP structure:
121+
122+
```
123+
(`1` . (`2` . (`3` . NIL)))
124+
```
125+
126+
A back reference to `3` would be: `0b1100` (right, left).
127+
128+
### Example back-reference
129+
130+
Consider the following LISP structure: ((`1` . `2`) . (`1` . `2`))
131+
It can be serialized as `0xff` `0xff` `1` `2` `0xfe` `0b10`
132+
133+
The parsing steps would be as follows:
134+
135+
| step | op-stack | parse-stack |
136+
| ------------------- | ---------------------------------------- | --------------------------- |
137+
| 1, initial state | Traverse | |
138+
| 2, parse `0xff` | Cons, Traverse, Traverse | |
139+
| 3, parse `0xff` | Cons, Traverse, Cons, Traverse, Traverse | |
140+
| 4, parse `1` | Cons, Traverse, Cons, Traverse | `1` |
141+
| 5, parse `2` | Cons, Traverse, Cons | `1`, `2` |
142+
| 6, pop2 and cons | Cons, Traverse | (`1` . `2`) |
143+
| 7, parse `0xfe` `2` | Cons | (`1` . `2`), (`1` . `2`) |
144+
| 8, pop2 and cons | | ((`1` . `2`) . (`1` . `2`)) |
145+
146+
### Referencing the stack itself
147+
148+
Back references aren't limited to just referencing items in the stack, but can
149+
reference any node in the stack. For example, consider parsing the following
150+
structure:
151+
152+
`0xff` `foobar` `0xff` `foobar` NIL
153+
154+
| step | op-stack | parse-stack |
155+
| ----------------- | ------------------------ | ----------- |
156+
| 1, initial state | Traverse | |
157+
| 2, parse `0xff` | Cons, Traverse, Traverse | |
158+
| 3, parse `foobar` | Cons, Traverse | `foobar` |
159+
160+
At this point, rather than parsing the next `0xff` pair, we could have a back
161+
reference (`0xfe`) with a path pointing to the root of the parse stack. In LISP
162+
form, the parse stack will be (`foobar` . NIL) - a list with one item. The rest
163+
of the CLVM tree is just the second `foobar` followed by the list terminator. It
164+
can be replaced with the parse stack itself. i.e. We can use a back-reference of
165+
`1`. We then get the NIL and the cons box "for free". It's implied by the parse
166+
stack.
167+
168+
In this scenario, the rest of the parsing steps are:
169+
170+
| step | op-stack | parse-stack |
171+
| ------------------- | -------- | ----------------------------- |
172+
| 4, parse `0xfe` `1` | Cons | `foobar`, (`foobar` . NIL) |
173+
| 5, pop2 and cons | | (`foobar` . (`foobar` . NIL)) |
174+
175+
In practice, however, this rarely happens.
176+
177+
## Generating back references
178+
179+
When serializing with compression, we need to assign a tree-hash and an
180+
(uncompressed) serialized length to every node. When deciding whether to output
181+
the sub-tree itself or a back-reference, we need to know whether we have already
182+
serialized an identical sub tree. If we have, we then have to perform a search
183+
from that node up all of its parents until we reach the top of the parse stack.
184+
This requires a data structure that knows about the parents of all nodes.
185+
186+
This search is performed in `find_path()`. There may be multiple paths leading
187+
to the stack (if the same structure is repeated in multiple places). We pick the
188+
_shortest_ path. This path may still be quite long, if the stack is deep or if
189+
the node is found deep down in a CLVM structure. We need to compare the length
190+
of the path against the serialized-length of the subtree. If the path is longer,
191+
it would be a net loss to replace it with a back reference.
192+
193+
During serialization, we need to track what the parse-stack will look like when
194+
deserializing, since this is part of the structure we need to search through
195+
when finding paths to previous sub trees.

0 commit comments

Comments
 (0)