-
Notifications
You must be signed in to change notification settings - Fork 90
Bacalhau project report 20220506
Work is underway in parallel on the productionizing / "the big refactor" branch, and investigating Python data science tools in WASM.
The goal is to, by the end of May, launch a real network (where we run ~10 nodes ourselves on GCP) with some real sample data pinned to those nodes usable by the public.
- Custom container images support so people can bring their own workloads (unverifiable, since arbitrary nondeterministic docker)
- Generally reliable, production-ready codebase - ready to start scaling contributors as well
Nice to have for the launch:
- Technology preview of Python WASM (which in future will be verifiable because we should be able to make it deterministic)
The goal here is to add support for custom docker images along with ripping out a lot of prototype code and upgrading it to production-readiness with nice interfaces.
We have made great progress with this and the code in the docker-image branch is in a much better state, with in-memory implementations to enable unit testing and a new IPFS docker sidecar storage driver. We've had some battles with stability with the IPFS docker sidecar with FUSE, and are also working on an alternative, simpler ipfs get
based implementation. (It turns out the ipfs daemon copies the data locally anyway, so just doing an ipfs get
before starting the container is way simpler and just as expensive in terms of disk usage/copying. We can optimize away the copying in a later phase.)
Tasks completed:
- noop executor and inprocess scheduler
- test the scheduling logic with the noop and inprocess mocks
- fuse mount ipfs docker storage driver
- start and manage the fuse mount docker sidecar container
- debug and test for the fuse mount ipfs storage driver
- general tooling:
- devstack with ipfs servers only
- function waiters with early error cancellation
- test utils for adding files and waiting for jobs
The big news here is that the PyScript project launched in the PyCon keynote and was all over Twitter. This is a push to run Python in the browser via WASM. This is great news for us because it's based on pyodide, the Python WASM runtime we were already looking at, and this means there will be a lot more attention on Pyodide and folks making their libraries work with it if they don't already. PyOdide already supports quite a few packages and while PyTorch and Tensorflow are quite a way off, numpy, pandas, scikit-learn and scikit-image are working and those are great staples for Python data engineering and data science work.
We spent quite a lot of time scrabbling around in the dirt looking for ways to get Pyodide running in WASI (e.g. wasmtime) with library support, but concluded that's a ways off. This has to do with Pyodide dependencies on the Emscriptem APIs, full details here. Instead, we are going to go with Node.js support. Here is a POC of Node.js running entirely outside the browser (in the CLI), running pandas data engineering code:
async function main() {
let pyodide_pkg = await import("./pyodide/pyodide.js");
let pyodide = await pyodide_pkg.loadPyodide({indexURL: "./pyodide/"});
await pyodide.loadPackage("micropip");
console.log(await pyodide.runPythonAsync("1+1"));
console.log(await pyodide.runPythonAsync(`
import micropip
micropip.install("pandas")
`))
console.log(await pyodide.runPythonAsync(`
import pandas
from io import StringIO
# CSV String with out headers
csvString = """Spark,25000,50 Days,2000
Pandas,20000,35 Days,1000
Java,15000,,800
Python,15000,30 Days,500
PHP,18000,30 Days,800"""
# Convert String into StringIO
csvStringIO = StringIO(csvString)
import pandas as pd
df = pd.read_csv(csvStringIO, sep=",", header=None)
print(df)
`));
}
main()
root@903bf06785ea:/src# node n.js
warning: no blob constructor, cannot create blobs with mimetypes
warning: no BlobBuilder
Loading distutils
Loaded distutils
Python initialization complete
distutils already loaded from default channel
Loading micropip, pyparsing, packaging
Loaded micropip, packaging, pyparsing
2
distutils already loaded from default channel
pyparsing already loaded from default channel
Loading pandas, numpy, python-dateutil, six, pytz, setuptools
Loaded six, python-dateutil, pytz, setuptools, numpy, pandas
undefined
0 1 2 3
0 Spark 25000 50 Days 2000
1 Pandas 20000 35 Days 1000
2 Java 15000 NaN 800
3 Python 15000 30 Days 500
4 PHP 18000 30 Days 800
undefined
Look, pandas dataframes in WASM!
So now the next questions/tasks are:
- Can we make it deterministic by turning off all sources of entropy? (by hacking the pyodide npm package?)
- How do we hook up data from ipfs? can we just pass a file descriptor connected to a CID into wasm? maybe using https://github.com/ipfs/js-ipfs in the outer node code, or our ipfs docker sidecar?
- Implement a wasm compute driver and a CLI subcommand for submitting a python function and requirements.txt