-
Notifications
You must be signed in to change notification settings - Fork 90
Bacalhau project report 20220414
The Compute-over-Data summit in Paris was a roaring success from our perspective: we got all the right minds in the room together and spent 3 days aligning on what we're building, deep-diving into specific design topics and hacking on the implementations. We also got valuable insights from folks who had built things like this before, which will be super valuable in avoiding pitfalls moving forwards.
The IPCS Master Plan is still solid, and we are proceeding on that basis (e.g. we are working on Phase 1 now), however the summit was valuable because we got additional detail/planning on specific sub-topics and focus areas, which I'll outline here.
Juan brought the concept of a low-level spec for execution of compute, and a high level way to express a request for that compute.
For example, the low level spec might define:
- A specific command, executed with specific CLI arguments, with a specific CID mounted at a specific location
- In a specific container image context
- Running in docker
- Running inside a firecracker VM for isolation
- With certain VM security properties (i.e. network disabled)
Whereas the user might just say sed
:-)
Similarly, the low level spec might define:
- A specific function, written in a specific version of Python, with a specific CID passed as a file-like function argument
- With a specific set of versioned PyPI packages installed
- Running in WASM
- With certain WASM security properties (i.e. randomness disabled, threading disabled)
- Running inside Docker (or not!)
- Running inside a firecracker VM (or not!)
Whereas the user might just say def foo(data): return data[0]
!
The operator might want to define certain constraints, such as "untrusted code not sandboxed must run in VM isolation".
The scheduler should operate on the low level definitions, and there should be a layer in between to solve for the simplest stack which delivers the required capability (given the constraints provided by both the requestor and the operator). This might be implemented using a SAT solver (hat tip to Dhash).
We had a useful session with Raul on FVM and his vision for IPVM. Long term, one idea is that the implementations of the IPCS interfaces might be WASM implementations that are hot-swappable at runtime. Initially though, everything will just go into the Go implementation.
We also spent time with WASM experts (hat tip to Thomas from Polyphene for pairing with me in the hackathon on this, and Dhash again) starting to answer the question "how fucked is the CPython runtime if you run it in WASM with all sources of entropy disabled?" - this is one the key pieces of technical risk as to whether our deterministic hash based verifier is viable for the first release.
In the future, our scheduler may achieve consensus and handle incentives and such using a smart contract on FVM. For now, it will just be libp2p, and we will roll our own WASM integration but we'll continue to collaborate with Raul and the rest of the FVM team as things progress.
The broad sketch of the libp2p scheduler is that all compute nodes will broadcast their available compute and current usage of that compute, and the clients (reqestor nodes) will do the scheduling/requesting.
The interesting question then becomes how can we make this libp2p scheduler extremely fast?
Dhash got us thinking about benchmarking the system in terms of how many thousands of function executions per second can be achieved? Where will the bottlenecks be in the system? Will IPFS perform at scale?
Dhash also pointed us to gg - demo - an insanely cool system that is capable of compiling a massive C codebase that normally takes hours on the fastest computer you can buy, run in 30 seconds by being massively distributed on AWS Lambda. Can we make gg target IPCS and have it work just as efficiently as AWS Lambda? That means executing 5000+ functions (containers) with parallelism up to ~2000 on thousands of nodes all in less than 30 seconds. That would be quite the demo!
Dave got OpenTelemetry working in Bacalhau, which is awesome:
We propose naming the specific first implementation IPCS, the Interplanetary Compute System, to make it clear that analagous to IPFS -- it's a volunteer network without incentives. I suggest we do this under the continued umbrella name of the Bacalhau project. Thoughts?
I've registered ipcs.network and we can put a docs site there. But I want to give folks a chance to object to the naming proposal before we start renaming everything in the code etc.
We had useful discussions with the creators of Holium (Philippe and Thomas), and Juan from zondax.ch. We also since had useful discussion on zoom with Brendan, the author of https://qri.io/.
All of these insights will feed into making IPCS as good as possible!
I'm sure I've missed other topics & key ideas that were discussed at the summit, feel free to edit this wiki post and add your own.
This week we have started:
- refactoring the compute interface out to prepare for the addition of a WASM FaaS (functions-as-a-service) mode
- getting a stripped-down Python FaaS running in WASM with all sources of entropy/nondeterminism disabled in the VM.
Expect more updates on these topics next week!
Huge thanks to everyone who showed up and participated in the summit, especially to Carolyn Lee and David Aronchick for arranging the summit, and to Protocol Labs for funding it. It was highly motivating and energising to connect and share ideas with this awesome community in real life. 😄