Verifiable Off-chain Data Aggregation #512

Kubuxu · 2022-11-09T17:39:07Z

Kubuxu
Nov 9, 2022
Collaborator

Motivation

A large majority of users onboard data onto the Filecoin network via an aggregator. Today the work done by aggregators is unverifiable and unprovable.
The user relies on the aggregator to perform the work correctly and at the same time, it is impossible to prove to a third party that a given piece of data was included in a deal which is a highly requested functionality for FVM.

Requirements

Provable inclusion of Client’s data within Aggregator’s on-chain data deal
Storage Provider is able to trivially find user’s data within the larger data segment
Malicious behaviour of the aggregator or another user, whose data was aggregated, does not cause irrecoverable problems with retrieval.

Specification

The proposal is split into three parts each of them addressing a part of the requirements.

Data Segment format for sectors which facilitates inclusion proofs and indexing on the side of Storage Provider
The Proof of Data Segment Inclusion within a sector or deal
A format for Batched Merkle Inclusion Proofs
Optionally a scheme for proving and verification of non-power-of-two aligned or length data segments

Data Segments without out-of-band data

Today a sector carries a set of deals, with information about where which deal is situated stored out-of-band within Filecoin’s consensus layer.
The Data Segment Format aims to allow for this data to be stored in-band, within the deal or sector data itself. The mechanism is independent of whether, in the future, Filecoin enables fully off-chain deals, where the deal maker and storage provider come to an agreement without involving Filecoin’s consensus, or the on-chain deal-making continues; it can be applied either on the sector or a deal level.

The primary goal is to inform the Storage Provider what data is stored where within the deal or sector in a verifiable way, without having to utilize the limited chain space.

Format

A data segment is a sequence of data which is guaranteed the following properties: it is possible to verify its inclusion within the container and it can be trivially found in the container by the Storage Provider to facilitate data retrieval.

It is important to highlight that on-chain deals, as realised today, fulfil the definition of a data segment but use the chain to facilitate the execution of the second property.

The information that is required from the data segment index are: Commitment to the data segment and its offset within the container, not required but we also decide to store the length of the data segment in the index as well.

The data segment index is a constant-size area at the end of a container containing an array of fixed-sized entries corresponding to the information of data segments. The data segment index is a data structure that is specifically designed to work well with the existing commitment scheme. This structure will have to evolve when a new commitment

Each entry consists of:

Commitment to the data segment (in the native format of the container, SHA256-truncated currently, without CID-prefix), 254bits
Offset within the container, 64bits, either as byte offset within the container, or a fixed-point fraction of the size of the container (in Q1.63 format).
Size of the data segment, 64bits, the same as the offset.
Optionally: type descriptor, the forever debate of in-band vs out-of-band type description (Windows extensions vs Unix MIME type detection).
A SHA-2 (or maybe blake2b-128 or keccak) hash of the above data, truncated to 126bits.

The commitment, offset and size facilitate the discovery of the data segment by the Storage Provider and are used by the Client or a third party to verify that the data segment of interest was properly indexed.

The truncated hash of that data is used for the discovery of valid entries within the data segment index. As the creation of the index is controlled by a third party, collision or preimage resistance are insignificant there. A lighter checksum function could be used in place of a cryptographic hash.

Each entry fits into 504 bits, which is twice the node size in the commitment used by PoRep v1.1. Each entry is aligned to the node double-node boundary to facilitate proving that the given data segment descriptor was included in the data segment index.

Proof of Data Segment Inclusion

The proof consists of two inclusion proofs:

An inclusion proof of a sub-tree corresponding to the tree of the client’s data, optionally this sub-tree can be miss-aligned with the proving tree.
An inclusion proof of the double leaf data segment descriptor within the data segment index.

The client possesses the following information: Commitment to their data $\mathrm{CommD}_ C$ and size of their data $|{\mathrm{D}_ C}|$.

The aggregator inserts client’s data into the sector or deal, and also adds the data segment descriptor into data segment index.

Following that, the aggregator produces the data commitment and two inclusion proofs and provides them to the client:

$\mathrm{CommD}_A$ - a commitment to the data created by the aggregator
$|\mathrm{D}_A|$ - size of the aggregator’s data, corresponding to deal/sector size.
$\mathrm{pos}_D$ - position of client’s data within the aggregator’s data
$\mathrm{idx}_{ds}$ - index of data segment descriptor within the data segment index
$\pi_{st}$ - sub-tree proof, proving the inclusion of $\mathrm{D}_C$ in $\mathrm{D}_A$ at position $\mathrm{pos}_D$
$\pi_{ds}$- leaf inclusion proof, proving the inclusion of $(\mathrm{CommD}_ A, \mathrm{pos}_ D, |\mathrm{D}_ C|, \mathrm{checksum})$ within $\mathrm{D}_ {A}$ at $f(\mathrm{idx}_ {ds}, |\mathrm{D}_A|)$

The $f(\mathrm{idx}, \mathrm{size})$ is a pre-agreed function mapping position within the data segment index to position within the container.

Auxiliary information provided by the aggregator to the client are: ActorID of the SP which accepted the deal and the sector number.

To complete the verification of the proof, the client has to verify the two inclusion proofs as well as check that a given sector was on-boarded and $\mathrm{CommD}_A$ was size $|\mathrm{D}_A|$:

$\mathrm{VerifyInclusion}(\mathrm{CommD}_ C, |\mathrm{D}_ C|, \mathrm{CommD}_ A, |\mathrm{D}_ A|, \pi_{st}, \mathrm{pos}_ D)$
$\mathrm{VerifyInclusion}(\mathrm{Comm}(\mathrm{CommD}_ C, \mathrm{pos}_ D, |\mathrm{D}_ C|, \mathrm{checksum}), 2, \mathrm{CommD}_ A, |\mathrm{D}_ A|, \pi_ {ds}, f(\mathrm{idx}_ {ds}, |\mathrm{D}_ A|))$
Verify that $|\mathrm{D}_A|$ corresponds to on-chain data size
Verify that $\textrm{CommD}_A$ was on-boarded in a sector.

Batched Merkle Tree inclusion proofs

Merkle Tree inclusion proofs are the primary way we use for proving that given data is where it claimed to be. This can change in the future but Merkle Tree proofs will stay around to be used for other purposes and cross-chain capabilities.

The size of Merkle inclusion proof for 2-ary tree generally is: sizeOfNode*(depthOfTree+1) , for the purpose of data inclusion we work with nodes of 32 bytes and trees 30 layers deep, resulting in the proof size of 992 bytes.

With no batching scheme, the proof size scales linearly with the number of provided proofs, but there exists common data when proving into the same tree root. This can provide significant savings when proofs are created and provided for nearby portions of the tree. The proof size saving can reach 90% which is expected for the data segment index area.

Design Rationale

The design was guided by the three requirements.

An inclusion proof of the data itself is the simplest way to achieve a verifiable deal aggregation. While this guarantees data inclusion within a sector, it does not provide any guarantees around the ability of the Storage Provider of finding that data.

A possibly simpler than the proposed approach way to achieve the second requirement is a specially designed CAR structure and padding, rendering a user-provided CAR readable as part of a larger, deal- or sector-wide, CAR. This has a drawback of CAR structure being fragile and the discoverability of user’s data within that larger CAR is a function of all the previous data within it. This data could be adversarial in nature, making data from user CAR unretrievable.

This is what steers the design towards the proposed approach, data segment index provides the discoverability of user data within the container, and is specially designed to lend itself to producing proofs. There are still open questions in that design.

An entry in the data segment index could include a type descriptor of the data contained within the data segment. This would necessitate creation of a registry of data segment types to allow Storage Providers to interpret the types within the index. An alternative to this approach is not to include a type descriptor and rely on type detection for identification. This is an approach widely used within computing space, the only operating system relying on external type descriptors is Windows through extensions, and while extensions are used to choose which application should open a given file type, the applications frequently conduct their own type detection.

Another still uncertain choice is the checksum function for the data segment index entries. While the proposal currently mentions a cryptographic hash function, none of its cryptographic properties are used, a possible faster-to-compute checksum function or non-cryptographic hash function could be used as long they provide universality and uniformity. If such a checksum function were to be chosen, the computation cost within Filecoin’s WASM and FEVM runtimes should be evaluated and compared with hash functions available as syscalls and precompiles.

willscott · 2022-11-09T23:39:25Z

willscott
Nov 9, 2022

You note that "The data segment index is a constant-size area at the end of a container", and that each segment will be 504 bits and "aligned to the node double-node boundary". - How large is the proposed data segment index / what is the maximum number of expressible data segments? Presumably this will scale with deal size.
what do we get by having a type descriptor in the data segment index? is there a reason we would want such a code cheaply available in an FVM context?
why is the hash truncated to 126 bits?

1 reply

Kubuxu Nov 10, 2022
Collaborator Author

How large is the proposed data segment index / what is the maximum number of expressible data segments? Presumably this will scale with deal size.

That is a trade-off we have to decide, roughly 2MiB (size after padding process) data segment index results in 32Ki possible data segment descriptors, 16MiB - 256Ki, 32MiB - 512Ki.

what do we get by having a type descriptor in the data segment index? is there a reason we would want such a code cheaply available in an FVM context?

I'm not sure on the benefit there, that is why I haven't made a decision on it yet. On one side it allows the client to verify that their data type was described properly, on the other client already knows precisely what is at that offset due to commitment inclusion proofs.

why is the hash truncated to 126 bits?

It serves as a checksum, to differentiate data segment index entries that were put there on purpose vs random data in that area. The truncation is performed because it just fills out the rest of a node, a node can store 254bits of information, 128bits are used by offset and length.
There is no need for collision resistance as whoever puts that data there controls the inputs, the primary goal is to prevent the need for indexing empty entries or invalid entries that are normal data instead of being deliberately there.

willscott · 2022-11-11T15:20:47Z

willscott
Nov 11, 2022

That is a trade-off we have to decide, roughly 2MiB (size after padding process) data segment index results in 32Ki possible data segment descriptors, 16MiB - 256Ki, 32MiB - 512Ki.

For a 32gb sector, 2MiB allocated index would allow supporting data segments of (32gb/32k =) ~1mb

That seems already pretty fine grained at relatively low overhead

1 reply

Kubuxu Nov 14, 2022
Collaborator Author

@ribasushi might want finer data segments/more of them.

There is an option of allowing the data to run into the data segment index if its full size is not required. The parsing and processing of the index will have to consider it as being possibly adversarial either way so I don't think there is much harm in allowing it to be overrun by data when it wouldn't be used. This will be the case for data segments that span the whole sector/deal either way.

anorth · 2022-11-14T02:11:46Z

anorth
Nov 14, 2022
Maintainer

How would this interact, or not, with deals for a full sector 32GiB? Would you expect that such deals for data sized to the sector just don't use this index?

If such deals did use the index, how would taking a few nodes at the extreme RHS of the merkle tree affect what could go in the rest of it? The next lower power of two piece size would waste half the sector, but using all the available space would seem like a worst case for non-alignment.

1 reply

Kubuxu Nov 14, 2022
Collaborator Author

Would you expect that such deals for data sized to the sector just don't use this index?

Yes, deals of this size do not require a data segment index as their location is clear, we can add it as a rule for the verification of proof of data segment inclusion, although in this case there won't be any proof really.
I would suggest for us to go even further, by defining points of interest in a sector: a sector/container with no known descriptors over a certain size is assumed to have points of interest for retrieval for example at positions: 0, 1/4, 1/2, 3/4 of its size.

anorth · 2022-11-14T02:17:06Z

anorth
Nov 14, 2022
Maintainer

Re type descriptors

An alternative to this approach is not to include a type descriptor and rely on type detection for identification. This is an approach widely used within computing space, the only operating system relying on external type descriptors is Windows through extensions, and while extensions are used to choose which application should open a given file type, the applications frequently conduct their own type detection.

Further than this observed practise, the world of multiformats, IPLD etc all thematically emphasise the advantages of self-describing data, be they hashes, addresses etc. While there are a number of examples where we fall short (e.g. HAMT root doesn't describe its bitwidth), I suggest we should lean into this principle and encourage piece data to carry its own metadata inline, rather than provide this side channel that will be only available in some contexts.

1 reply

Kubuxu Nov 14, 2022
Collaborator Author

I agree, I couldn't find a clear internal line of reasoning for why I was leaning toward the inline type description but this captures it very well.

Fatman13 · 2023-11-29T07:03:59Z

Fatman13
Nov 29, 2023
Collaborator

Does an FRC needs to be accepted? I see this FRC is already being referred as a spec?

4 replies

anorth Nov 29, 2023
Maintainer

No, an FRC will typically remain draft for a long time. There's no practical distinction since they don't affect the core protocol. We might upgrade some to Accepted after they demonstrate wide adoption, but this is nothing more than a signal that we've seen that wide acceptance.

Fatman13 Dec 1, 2023
Collaborator

kk 👍

f8-ptrk Jan 2, 2024

is there the possibility for it to be rejected if wide adoption is not achieved? (if thats even measurable....)

anorth Jan 4, 2024
Maintainer

All of this is just signalling, none of it binds anyone. I guess we could, but would probably require something more than mere non-adoption, like discovering a serious flaw and replacing it with a new standard.

lordshashank · 2024-10-09T14:45:32Z

lordshashank
Oct 9, 2024

There has been a prevailing indexing issue with PODSI pieces, I tried to debug that in-depth and cover PODSI state along with that. Here are my findings:-
https://docs.google.com/document/d/1JxKH7LD3PZz2LgJwThZkUDRBbOi-Q5--nSSnn5EMSSk/edit?usp=sharing

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifiable Off-chain Data Aggregation #512

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Verifiable Off-chain Data Aggregation #512

Kubuxu Nov 9, 2022 Collaborator

Motivation

Requirements

Specification

Data Segments without out-of-band data

Format

Proof of Data Segment Inclusion

Batched Merkle Tree inclusion proofs

Design Rationale

Replies: 6 comments · 8 replies

willscott Nov 9, 2022

Kubuxu Nov 10, 2022 Collaborator Author

willscott Nov 11, 2022

Kubuxu Nov 14, 2022 Collaborator Author

anorth Nov 14, 2022 Maintainer

Kubuxu Nov 14, 2022 Collaborator Author

anorth Nov 14, 2022 Maintainer

Kubuxu Nov 14, 2022 Collaborator Author

Fatman13 Nov 29, 2023 Collaborator

anorth Nov 29, 2023 Maintainer

Fatman13 Dec 1, 2023 Collaborator

f8-ptrk Jan 2, 2024

anorth Jan 4, 2024 Maintainer

lordshashank Oct 9, 2024

Kubuxu
Nov 9, 2022
Collaborator

Replies: 6 comments 8 replies

willscott
Nov 9, 2022

Kubuxu Nov 10, 2022
Collaborator Author

willscott
Nov 11, 2022

Kubuxu Nov 14, 2022
Collaborator Author

anorth
Nov 14, 2022
Maintainer

Kubuxu Nov 14, 2022
Collaborator Author

anorth
Nov 14, 2022
Maintainer

Kubuxu Nov 14, 2022
Collaborator Author

Fatman13
Nov 29, 2023
Collaborator

anorth Nov 29, 2023
Maintainer

Fatman13 Dec 1, 2023
Collaborator

anorth Jan 4, 2024
Maintainer

lordshashank
Oct 9, 2024