Skip to content
This repository has been archived by the owner on Mar 24, 2023. It is now read-only.

Merkle tree misc fixes #47

Merged
merged 21 commits into from
Jul 2, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 85 additions & 55 deletions specs/data_structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,11 @@ Data Structures
- [Public-Key Cryptography](#public-key-cryptography)
- [Merkle Trees](#merkle-trees)
- [Binary Merkle Tree](#binary-merkle-tree)
- [Annotated Merkle Tree](#annotated-merkle-tree)
- [Verifying Annotated Merkle Proofs](#verifying-annotated-merkle-proofs)
adlerjohn marked this conversation as resolved.
Show resolved Hide resolved
- [Binary Merkle Tree Proofs](#binary-merkle-tree-proofs)
- [Namespace Merkle Tree](#namespace-merkle-tree)
- [Namespace Merkle Tree Proofs](#namespace-merkle-tree-proofs)
- [Sparse Merkle Tree](#sparse-merkle-tree)
- [Sparse Merkle Tree Proofs](#sparse-merkle-tree-proofs)
- [Erasure Coding](#erasure-coding)
- [Reed-Solomon Erasure Coding](#reed-solomon-erasure-coding)
- [2D Reed-Solomon Encoding Scheme](#2d-reed-solomon-encoding-scheme)
Expand Down Expand Up @@ -228,100 +229,129 @@ Merkle trees are used to authenticate various pieces of data across the LazyLedg

## Binary Merkle Tree

Binary Merkle trees are constructed in the usual fashion, with leaves being hashed once to get leaf node values and internal node values being the hash of the concatenation of their children. The specific mechanism for hashing leaves for leaf nodes and children for internal nodes may be different (see: [annotated Merkle trees](#annotated-merkle-tree)), but for plain binary Merkle trees are the same.
Binary Merkle trees are constructed in the same fashion as described in [Certificate Transparency (RFC-6962)](https://tools.ietf.org/html/rfc6962). Leaves are hashed once to get leaf node values and internal node values are the hash of the concatenation of their children (either leaf nodes or other internal nodes).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to specify how it's the same as RFC 6962. Specifically, that the tree is unbalanced when there isn't 2^x leaves. The next sentence is describing properties of all Merkle trees.


For leaf node of leaf message `m`, its value `v` is:
```C++
v = h(serialize(m))
```

An exception is made, in the case of empty leaves: the value of a leaf node with an empty leaf is 32-byte zero, i.e. `0x0000000000000000000000000000000000000000000000000000000000000000`. This is used rather than duplicating the last node if there are an odd number of nodes (the [Bitcoin design](https://github.com/bitcoin/bitcoin/blob/5961b23898ee7c0af2626c46d5d70e80136578d3/src/consensus/merkle.cpp#L9-L43)) to avoid the complexities in that design, which resulted in e.g. [CVE-2012-2459](https://nvd.nist.gov/vuln/detail/CVE-2012-2459). By constructions, trees are implicitly padded with empty leaves up to the smallest enclosing power of 2.
Nodes contain a single field:
| name | type | description |
| ---- | ------------------------- | ----------- |
| `v` | [HashDigest](#hashdigest) | Node value. |

For internal node with children `l` and `r`, its value `v` is:
The base case (an empty tree) is defined as zero:
```C++
v = h(l.v, r.v)
node.v = 0x0000000000000000000000000000000000000000000000000000000000000000
```

## Annotated Merkle Tree

Merkle trees can be augmented as generic annotated Merkle trees, where additional fields can be contained in each node. One of the early annotated Merkle trees is the [Merkle Sum Tree](https://bitcointalk.org/index.php?topic=845978.0), which allows for compact fraud proofs to be made of fees collected in a block.

Annotated Merkle trees have extra fields and methods to compute values for those fields, i.e. `f_1, ..., f_n, v` for `n` fields (note that if `n=0`, the annotated Merkle tree is a plain [binary Merkle tree](#binary-merkle-tree)). The value of field `f_i` is computed with the method `m_i_i(height, left_child_field, right_child_field)` for internal nodes and `m_i_l(message)` for leaf nodes.

For leaf node of leaf message `m`, its value `v` and fields `f_1, ..., f_n` are:
For leaf node `node` of leaf data `d`:
```C++
f_1 = m_1_l(m)
...
f_n = m_n_l(m)
v = h(serialize(m))
node.v = h(0x00, serialize(d))
```

For internal node at height `height` with children `l` and `r`, its value `v` and fields `f_1, ..., f_n` are:
For internal node `node` with children `l` and `r`:
```C++
f_1 = m_1_i(height, l.f_1, r.f_1)
...
f_n = m_n_i(height, l.f_n, r.f_n)
v = h(l.f_1, ..., l.f_n, l.v, r.f_1, ..., r.f_n, r.v)
node.v = h(0x01, serialize(l), serialize(r))
```

If a compact Merkle root is needed, the root level (which consists of root fields and a root value) can be hashed once.
Note that rather than duplicating the last node if there are an odd number of nodes (the [Bitcoin design](https://github.com/bitcoin/bitcoin/blob/5961b23898ee7c0af2626c46d5d70e80136578d3/src/consensus/merkle.cpp#L9-L43)), trees are allowed to be imbalanced. In other words, the height of each leaf may be different. For an example, see Section 2.1.3 of [Certificate Transparency (RFC-6962)](https://tools.ietf.org/html/rfc6962).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nvm, it's specified here.


As an example of annotation, when hashing leaves, `0x00` can be prepended, and when hashing internal nodes, `0x01` can be prepended (i.e. `m_1_l() = 0x00` and `m_1_i() = 0x01`). This avoids a second-preimage attack [where internal nodes are presented as leaves](https://en.wikipedia.org/wiki/Merkle_tree#Second_preimage_attack) for incomplete trees.
Leaves and internal nodes are hashed differently: the one-byte `0x00` is prepended for leaf nodes while `0x01` is prepended for internal nodes. This avoids a second-preimage attack [where internal nodes are presented as leaves](https://en.wikipedia.org/wiki/Merkle_tree#Second_preimage_attack) trees with leaves at different heights.

### Verifying Annotated Merkle Proofs
### Binary Merkle Tree Proofs

In addition to the root, leaf, index, and sibling values of a Merkle proof for a plain [binary Merkle tree](#binary-merkle-tree), Merkle proofs for annotated Mekle trees have the sibling field values. Proofs are verified by using the appropriate methods to compute field values.
| name | type | description |
| ---------- | ----------------------------- | ----------------------------- |
| `root` | [HashDigest](#hashdigest) | Merkle root. |
| `key` | `byte[32]` | Key (i.e. index) of the leaf. |
| `siblings` | [HashDigest](#hashdigest)`[]` | Sibling hash values. |
| `leaf` | `byte[]` | Leaf value. |

## Namespace Merkle Tree

[Messages](#message) in LazyLedger are associated with a provided _namespace ID_, which identifies the application (or applications) that will read these messages when parsing blocks. The Namespace Merkle Tree (NMT) is a variation of the [Merkle Interval Tree](https://eprint.iacr.org/2018/642).
[Shares](#share) in LazyLedger are associated with a provided _namespace ID_. The Namespace Merkle Tree (NMT) is a variation of the [Merkle Interval Tree](https://eprint.iacr.org/2018/642), which is itself an extension of the [Merkle Sum Tree](https://bitcointalk.org/index.php?topic=845978.0). It allows for compact proofs around the inclusion or exclusion of shares with particular namespace IDs.

The NMT is an annotated Merkle tree with two additional fields and methods that indicate the range of namespace IDs in each node's subtree.
Nodes contain three fields:
| name | type | description |
| ------- | ---------------------------- | ------------------------------------------------ |
| `n_min` | [NamespaceID](#type-aliases) | Min namespace ID in subtree rooted at this node. |
| `n_max` | [NamespaceID](#type-aliases) | Max namespace ID in subtree rooted at this node. |
| `v` | [HashDigest](#hashdigest) | Node value. |

For leaf node of message `m`:
The base case (an empty tree) is defined as:
```C++
n_min = m_1_l(m) = m.namespaceID
n_max = m_2_l(m) = m.namespaceID
v = h(serialize(m))
node.n_min = 0x0000000000000000000000000000000000000000000000000000000000000000
node.n_max = 0x0000000000000000000000000000000000000000000000000000000000000000
node.v = 0x0000000000000000000000000000000000000000000000000000000000000000
```

The `namespaceID` message field here is the namespace ID of the message, which is a [`NAMESPACE_ID_BYTES`](consensus.md#system-parameters)-long byte array.
For leaf node `node` of data `d`:
```C++
node.n_min = d.namespaceID
node.n_max = d.namespaceID
node.v = h(0x00, serialize(d))
Copy link
Member

@liamsi liamsi Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nitpick: You implicitly assuming that data d is a structure as well namely struct{namespaceID, raw_data}. And that last line is still ambiguous to what gets hashed in the end: if h(0x00, d.raw_data) or h(0x00, d.namespaceID||d.raw_data)) (or whatever the result of serialize(d) is).

Note: as far as I understand both h(0x00, d.raw_data) or h(0x00, d.namespaceID||d.raw_data)) correctly define a NMT but I think we should be explicit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the serialization of shares is defined as a special case: https://github.com/lazyledger/lazyledger-specs/blob/adlerjohn-merkle_tree_fixes/specs/data_structures.md#share-serialization.

Shares canonically serialized using only the raw share data, i.e. serialize(share) = serialize(share.rawData).

Copy link
Member

@liamsi liamsi Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that. But that doesn't address that you implicitly assume that d has to have a structure (namely two fields) but I guess the important part is covered in the spec (how the shares end up in the tree ...).

```

Before being hashed, the [messages](#message) are [serialized](#serialization).
The `namespaceID` message field here is the namespace ID of the leaf, which is a [`NAMESPACE_ID_BYTES`](consensus.md#system-parameters)-long byte array.

For internal node with children `l` and `r`:
For internal node `node` with children `l` and `r`:
```C++
n_min = m_1_i(height, l, r) = min(l.n_min, r.n_min)
n_max = m_2_i(height, l, r) = max(l.n_max, r.n_max)
v = h(l, r) = h(l.n_min, l.n_max, l.v, r.n_min, r.n_max, r.v)
node.n_min = min(l.n_min, r.n_min)
node.n_max = max(l.n_max, r.n_max)
node.v = h(l, r) = h(0x01, serialize(l), serialize(r))
```

## Sparse Merkle Tree
A root hash can be computed by taking the [hash](#hashing) of the [serialized](#serialization) root node.

Sparse Merkle Trees (SMTs) are _sparse_, i.e. they contain mostly empty leaves. They can be used as key-value stores for arbitrary data, as each leaf is keyed by its index in the tree. Storage efficiency is achieved through clever use of implicit defaults, avoiding the need to store empty leaves.
### Namespace Merkle Tree Proofs

Default values are given to leaf nodes with empty leaves. While this is sufficient to pre-compute the values of intermediate nodes that are roots of empty subtrees, a further simplification is to extend this default value to all nodes that are roots of empty subtrees. The 32-byte zero, i.e. `0x0000000000000000000000000000000000000000000000000000000000000000`, is used as the default value.
| name | type | description |
| -------------------- | -------------------------------- | ----------------------------- |
| `rootHash` | [HashDigest](#hashdigest) | Root hash. |
| `rootNamespaceIDMin` | [NamespaceID](#type-aliases) | Root minimum namespace ID. |
| `rootNamespaceIDMax` | [NamespaceID](#type-aliases) | Root maximum namespace ID. |
| `key` | `byte[32]` | Key (i.e. index) of the leaf. |
| `siblingValues` | [HashDigest](#hashdigest)`[]` | Sibling hash values. |
| `siblingMins` | [NamespaceID](#type-aliases)`[]` | Sibling min namespace IDs. |
| `siblingMaxes` | [NamespaceID](#type-aliases)`[]` | Sibling max namespace IDs. |
| `leaf` | `byte[]` | Leaf value. |

SMTs can further be extended with _compact_ proofs. [Merkle proofs](#verifying-annotated-merkle-proofs) are composed, among other things, of a list of sibling node values. We note that, since nodes that are roots of empty subtrees have known values (the default value), these values do not need to be provided explicitly; it is sufficient to simply identify which siblings in the Merkle branch are roots of empty subtrees, which can be done with one bit per sibling.
When verifying a NMT proof, the root hash is checked by reconstructing the root node `root_node` with the computed `root_node.v` (computed as with a [plain Merkle proof](#binary-merkle-tree-proofs)) and the provided `rootNamespaceIDMin` and `rootNamespaceIDMax` as the `root_node.n_min` and `root_node.n_max`, respectively.

For a Merkle branch of height `h`, an `h`-bit value is appended to the proof. The lowest bit corresponds to the sibling of the leaf node, and each higher bit corresponds to the next parent. A value of `1` indicates that the next value in the list of values provided explicitly in the proof should be used, and a value of `0` indicates that the default value should be used.
## Sparse Merkle Tree

Sparse Merkle Trees (SMTs) are _sparse_, i.e. they contain mostly empty leaves. They can be used as key-value stores for arbitrary data, as each leaf is keyed by its index in the tree. Storage efficiency is achieved through clever use of implicit defaults, avoiding the need to store empty leaves.

Finally, the number of hashing operations can be reduced to be logarithmic in the number of non-empty leaves on average. An internal node that is the root of a subtree that contains exactly one non-empty leaf is replaced by that leaf's leaf node.
Additional rules are added on top of plain [binary Merkle trees](#binary-merkle-tree):
1. Default values are given to leaf nodes with empty leaves.
1. While the above rule is sufficient to pre-compute the values of intermediate nodes that are roots of empty subtrees, a further simplification is to extend this default value to all nodes that are roots of empty subtrees. The 32-byte zero, i.e. `0x0000000000000000000000000000000000000000000000000000000000000000`, is used as the default value. This rule takes precedence over the above one.
1. The number of hashing operations can be reduced to be logarithmic in the number of non-empty leaves on average, assuming a uniform distribution of non-empty leaf keys. An internal node that is the root of a subtree that contains exactly one non-empty leaf is replaced by that leaf's leaf node.

This creates an imbalanced tree with leaf nodes at different heights, so leaves and nodes must be hashed differently to avoid a second-preimage attack [where internal nodes are presented as leaf nodes](https://en.wikipedia.org/wiki/Merkle_tree#Second_preimage_attack). When hashing leaves, the `uint8` value `0x00` is prepended to the leaf value, and when hashing nodes, `0x01` is prepended to the hash value.
Nodes contain a single field:
| name | type | description |
| ---- | ------------------------- | ----------- |
| `v` | [HashDigest](#hashdigest) | Node value. |

Additionally, the key of leaf nodes must be prepended, since the index of a leaf node that is not at the base of the tree cannot be determined without this information.
The base case (an empty tree) is defined as the default value:
```C++
node.v = 0x0000000000000000000000000000000000000000000000000000000000000000
```

For leaf node of leaf message `m` with key `k`, its value `v` is:
For leaf node `node` of leaf data `d` with key `k`:
```C++
v = h(`0x00`, k, serialize(m))
node.v = h(0x00, k, serialize(d))
```

For internal node with children `l` and `r`, its value `v` is:
The key of leaf nodes must be prepended, since the index of a leaf node that is not at the base of the tree cannot be determined without this information.

For internal node `node` with children `l` and `r`:
```C++
v = h(`0x01`, l.v, r.v)
node.v = h(0x01, serialize(l), serialize(r))
```

### Sparse Merkle Tree Proofs

SMTs can further be extended with _compact_ proofs. [Merkle proofs](#verifying-annotated-merkle-proofs) are composed, among other things, of a list of sibling node values. We note that, since nodes that are roots of empty subtrees have known values (the default value), these values do not need to be provided explicitly; it is sufficient to simply identify which siblings in the Merkle branch are roots of empty subtrees, which can be done with one bit per sibling.

For a Merkle branch of height `h`, an `h`-bit value is appended to the proof. The lowest bit corresponds to the sibling of the leaf node, and each higher bit corresponds to the next parent. A value of `1` indicates that the next value in the list of values provided explicitly in the proof should be used, and a value of `0` indicates that the default value should be used.

A proof into an SMT is structured as:

| name | type | description |
Expand Down