Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store car alongside content on s3 to then move it in a deterministic way to ipfs #5

Merged
merged 8 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ JWT_SECRET="foo"
DB_CONNECTION_STRING="postgresql://user:[email protected]:5432/data-uploader"
W3UP_PRINCIPAL_KEY="foo"
W3UP_DELEGATION_PROOF="foo"
S3_ENDPOINT="http://foo.bar"
S3_BUCKET="foo"
S3_ACCESS_KEY_ID="foo"
S3_SECRET_ACCESS_KEY="foo"
9 changes: 9 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,14 @@ ENV W3UP_PRINCIPAL_KEY=$W3UP_PRINCIPAL_KEY
ARG W3UP_DELEGATION_PROOF
ENV W3UP_DELEGATION_PROOF=$W3UP_DELEGATION_PROOF

ARG S3_BUCKET
ENV S3_BUCKET=$S3_BUCKET

ARG S3_ACCESS_KEY_ID
ENV S3_ACCESS_KEY_ID=$S3_ACCESS_KEY_ID

ARG S3_SECRET_ACCESS_KEY
ENV S3_SECRET_ACCESS_KEY=$S3_SECRET_ACCESS_KEY

EXPOSE $PORT
ENTRYPOINT ["node", "index.mjs"]
128 changes: 126 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,132 @@

# Carrot data uploader

This project implements a simple server that acts as a proxy to various storage
services. Its API can be accessed with a valid JWT.
This service is responsible for managing data in the Carrot protocol, which
primarily falls into two categories at the time of writing:

1. **Templates:** Represent Carrot templates, including Webpack federated React
components, CSS, and a `base.json` metadata file.
2. **Generic specifications:** Comprise JSON files providing information about
various entities, such as KPI token campaign specifications and DefiLlama
oracle specifications.

### Data States

Data in Carrot exists in two main states:

- **Limbo:** data in limbo doesn't yet need to be persisted but is a potential
candidate for persistence. It includes items like Carrot templates with active
deployment proposals and specifications for entities that have yet to be
created.
- **Persistent:** data in the persistent state is data that is referenced by
on-chain entities within the Carrot protocol. This data needs to be reliably
available at all times and for an extremely long period of time.

### On-Chain Data Reference

The on-chain reference mechanism previously mentioned and used to determine if
data should be persisted and removed from limbo is based on CIDs following the
`multiformats` CIDv1 specification. A given CID is considered referenced
on-chain when it's stored in the blockchain's state by a Carrot protocol etity.
At that point the data referenced by that CID needs to be persisted.

### Storage Locations

Data in Carrot is mainly stored in two locations:

- **AWS S3 Bucket:** this is a solution for hot/warm storage of both limbo and
persistent data, served through a CloudFront distributed CDN for quick access.
The S3 bucket contains all non-expired limbo data (both raw data and IPFS CAR
data) plus all persisted data, and is indexed using CIDs for the data itself.

- **IPFS/Filecoin:** here we exclusively store persistent data that needs to be
extremely long lived and available in a decentralized way. Web3.storage is
utilized for IPFS data uploads and Filecoin persistence operations.

### API endpoints

1. **`/data/s3/json`:** this endpoint can be used to store JSON limbo data. The
API takes the raw input JSON, encodes it into the IPFS CAR format and
determines the raw data CID. Both the raw content and the CAR file are
uploaded to the S3 bucket using the CID as the base key (the raw content uses
the CID itself as the key, while the CAR is uploaded under `$CID/car`).

2. **`/data/ipfs`:** this endpoint persists limbo data and replicates it to
IPFS/Filecoin. The API accepts a single parameter `cid` which must refer to
some limbo data that the caller wants to persist to IPFS/Filecoin. The API
fetches the CAR associated with the passed CID (stored on the S3 bucket under
`$CID/car`) and stores the fetched CAR file on IPFS/Filecoin through
web3.storage's w3up service. The resulting upload CID is checked for
consistency and if everything is fine the raw data is also persisted on the
S3 bucket while the CAR is deleted from there.

### Benefits of this approach

This centralized approach where only this service manages Carrot data has a few
extremely important benefits.

#### Deterministic CIDs

IPFS can store data in different formats, and depending on the picked format,
the same starting data can result in different multihashes once uploaded to the
network, which in the end results in different CIDs. This is a problem for
Carrot because the on-chain CID references are immutable and we need some way to
guarantee that the on-chain CIDs reference some real data that is in fact stored
on IPFS.

Let's have the following example:

1. A template author wants to add a template to Carrot to unlock some specific
functionality. He builds the template and ends up with the final template's
code, which he uploads to IPFS using a pinning service such as Pinata.
2. The output step from step 1 is the template's code CID, which can be used to
create a proposal to add the template to Carrot on-chain. The proposal is
created.
3. After some time, the proposal is approved and the template is added to Carrot
on-chain. This results in the template code'S CID being referenced on-chain,
which should make the data persistent in Carrot, as explained above.
4. The IPFS pinner daemon picks up this added reference and makes the template
code persistent on IPFS. In order to do that it downloads the template's code
from IPFS and uploads it to web3.storage through a dedicated library. This
library follows a different data encoding prodedure, resulting in a different
multihash and CID at the end of the process. **So at this point the same
starting data has been added to IPFS in different ways, resulting in a CID
mismatch.**
5. After some time the author unpins from Pinata the template's code.

The end result? The template's code has been put in limbo and then persisted to
IPFS in 2 different ways, resulting in 2 different CIDs, and now the limbo data
is no more. We end up with a dangling CID: **the on-chain reference to the
template's code is referencing data the doesn't exist anywhere**.

The best solution to avoid this scenario is to handle both limbo data addition
and persistent data addition in the same place, and this place is the
`data-uploader` service. Adding data to limbo will cause the `data-uploader`
service to calculate this data's CID by creating an IPFS CAR containing the
data, and returning this CID to the caller. **It's then responsibility of the
caller to use that CID to reference the limbo data**. As long as the caller does
that, we have an extremely strong guarantee that when the data will be persisted
it will be persisted with the same original CID. This is because the peristence
process is performed by storing the CAR file on IPFS/Filecoin, the same CAR file
that was originarily used to determine's the data CID.

#### Performance and decentralization

Through the double S3/IPFS storing mechanism we can guarantee the best
properties of both worlds. If a Carrot user doesn't have strong decentralization
guarantee he will be able to access all Carrot data from the S3 bucket directly
through a distributed CloudFront CDN, as the bucket always contains all limbo
data + persisted data. The addition of the CDN also boosts data delivery
performance, resulting in a snappier and overall better experience.

For users that want the maximum amount of decentralization and trustlessness
it's also possible to access Carrot data directly from IPFS too, as IPFS will
have all Carrot's persistent data at all times. In most cases this won't have
the same performance of a distributed CloudFrontn CDN though.

This setup is especially powerful (in both decentralization and trustlessness)
if coupled with a frontend that allows using a locally hosted IPFS node to
access the data.

## Tech used

Expand Down
2 changes: 2 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"devDependencies": {
"@commitlint/cli": "^18.4.4",
"@commitlint/config-conventional": "^18.4.4",
"@smithy/types": "^2.9.1",
"@types/jsonwebtoken": "^9.0.5",
"@types/pg": "^8.10.9",
"dotenv": "^16.3.1",
Expand Down Expand Up @@ -48,6 +49,7 @@
"hapi-swagger": "^17.2.0",
"joi": "^17.11.1",
"jsonwebtoken": "^9.0.2",
"multiformats": "^13.0.1",
"pg": "^8.11.3",
"viem": "^2.2.0"
}
Expand Down
Loading
Loading