Verifiable Off-chain Data Aggregation #512
Replies: 6 comments 8 replies
-
|
Beta Was this translation helpful? Give feedback.
-
For a 32gb sector, 2MiB allocated index would allow supporting data segments of (32gb/32k =) ~1mb That seems already pretty fine grained at relatively low overhead |
Beta Was this translation helpful? Give feedback.
-
How would this interact, or not, with deals for a full sector 32GiB? Would you expect that such deals for data sized to the sector just don't use this index? If such deals did use the index, how would taking a few nodes at the extreme RHS of the merkle tree affect what could go in the rest of it? The next lower power of two piece size would waste half the sector, but using all the available space would seem like a worst case for non-alignment. |
Beta Was this translation helpful? Give feedback.
-
Re type descriptors
Further than this observed practise, the world of multiformats, IPLD etc all thematically emphasise the advantages of self-describing data, be they hashes, addresses etc. While there are a number of examples where we fall short (e.g. HAMT root doesn't describe its bitwidth), I suggest we should lean into this principle and encourage piece data to carry its own metadata inline, rather than provide this side channel that will be only available in some contexts. |
Beta Was this translation helpful? Give feedback.
-
Does an FRC needs to be accepted? I see this FRC is already being referred as a spec? |
Beta Was this translation helpful? Give feedback.
-
There has been a prevailing indexing issue with PODSI pieces, I tried to debug that in-depth and cover PODSI state along with that. Here are my findings:- |
Beta Was this translation helpful? Give feedback.
-
Motivation
A large majority of users onboard data onto the Filecoin network via an aggregator. Today the work done by aggregators is unverifiable and unprovable.
The user relies on the aggregator to perform the work correctly and at the same time, it is impossible to prove to a third party that a given piece of data was included in a deal which is a highly requested functionality for FVM.
Requirements
Specification
The proposal is split into three parts each of them addressing a part of the requirements.
Data Segments without out-of-band data
Today a sector carries a set of deals, with information about where which deal is situated stored out-of-band within Filecoin’s consensus layer.
The Data Segment Format aims to allow for this data to be stored in-band, within the deal or sector data itself. The mechanism is independent of whether, in the future, Filecoin enables fully off-chain deals, where the deal maker and storage provider come to an agreement without involving Filecoin’s consensus, or the on-chain deal-making continues; it can be applied either on the sector or a deal level.
The primary goal is to inform the Storage Provider what data is stored where within the deal or sector in a verifiable way, without having to utilize the limited chain space.
Format
A data segment is a sequence of data which is guaranteed the following properties: it is possible to verify its inclusion within the container and it can be trivially found in the container by the Storage Provider to facilitate data retrieval.
It is important to highlight that on-chain deals, as realised today, fulfil the definition of a data segment but use the chain to facilitate the execution of the second property.
The information that is required from the data segment index are: Commitment to the data segment and its offset within the container, not required but we also decide to store the length of the data segment in the index as well.
The data segment index is a constant-size area at the end of a container containing an array of fixed-sized entries corresponding to the information of data segments. The data segment index is a data structure that is specifically designed to work well with the existing commitment scheme. This structure will have to evolve when a new commitment
Each entry consists of:
The commitment, offset and size facilitate the discovery of the data segment by the Storage Provider and are used by the Client or a third party to verify that the data segment of interest was properly indexed.
The truncated hash of that data is used for the discovery of valid entries within the data segment index. As the creation of the index is controlled by a third party, collision or preimage resistance are insignificant there. A lighter checksum function could be used in place of a cryptographic hash.
Each entry fits into 504 bits, which is twice the node size in the commitment used by PoRep v1.1. Each entry is aligned to the node double-node boundary to facilitate proving that the given data segment descriptor was included in the data segment index.
Proof of Data Segment Inclusion
The proof consists of two inclusion proofs:
The client possesses the following information: Commitment to their data$\mathrm{CommD}_ C$ and size of their data $|{\mathrm{D}_ C}|$ .
The aggregator inserts client’s data into the sector or deal, and also adds the data segment descriptor into data segment index.
Following that, the aggregator produces the data commitment and two inclusion proofs and provides them to the client:
The$f(\mathrm{idx}, \mathrm{size})$ is a pre-agreed function mapping position within the data segment index to position within the container.
Auxiliary information provided by the aggregator to the client are: ActorID of the SP which accepted the deal and the sector number.
To complete the verification of the proof, the client has to verify the two inclusion proofs as well as check that a given sector was on-boarded and$\mathrm{CommD}_A$ was size $|\mathrm{D}_A|$ :
Batched Merkle Tree inclusion proofs
Merkle Tree inclusion proofs are the primary way we use for proving that given data is where it claimed to be. This can change in the future but Merkle Tree proofs will stay around to be used for other purposes and cross-chain capabilities.
The size of Merkle inclusion proof for 2-ary tree generally is:
sizeOfNode*(depthOfTree+1)
, for the purpose of data inclusion we work with nodes of 32 bytes and trees 30 layers deep, resulting in the proof size of 992 bytes.With no batching scheme, the proof size scales linearly with the number of provided proofs, but there exists common data when proving into the same tree root. This can provide significant savings when proofs are created and provided for nearby portions of the tree. The proof size saving can reach 90% which is expected for the data segment index area.
Design Rationale
The design was guided by the three requirements.
An inclusion proof of the data itself is the simplest way to achieve a verifiable deal aggregation. While this guarantees data inclusion within a sector, it does not provide any guarantees around the ability of the Storage Provider of finding that data.
A possibly simpler than the proposed approach way to achieve the second requirement is a specially designed CAR structure and padding, rendering a user-provided CAR readable as part of a larger, deal- or sector-wide, CAR. This has a drawback of CAR structure being fragile and the discoverability of user’s data within that larger CAR is a function of all the previous data within it. This data could be adversarial in nature, making data from user CAR unretrievable.
This is what steers the design towards the proposed approach, data segment index provides the discoverability of user data within the container, and is specially designed to lend itself to producing proofs. There are still open questions in that design.
An entry in the data segment index could include a type descriptor of the data contained within the data segment. This would necessitate creation of a registry of data segment types to allow Storage Providers to interpret the types within the index. An alternative to this approach is not to include a type descriptor and rely on type detection for identification. This is an approach widely used within computing space, the only operating system relying on external type descriptors is Windows through extensions, and while extensions are used to choose which application should open a given file type, the applications frequently conduct their own type detection.
Another still uncertain choice is the checksum function for the data segment index entries. While the proposal currently mentions a cryptographic hash function, none of its cryptographic properties are used, a possible faster-to-compute checksum function or non-cryptographic hash function could be used as long they provide universality and uniformity. If such a checksum function were to be chosen, the computation cost within Filecoin’s WASM and FEVM runtimes should be evaluated and compared with hash functions available as syscalls and precompiles.
Beta Was this translation helpful? Give feedback.
All reactions