Skip to content

Bacalhau project report 20220506

Kai Davenport edited this page May 6, 2022 · 13 revisions

Work is underway in parallel on the productionizing / "the big refactor" branch, and investigating Python data science tools in WASM.

End of May launch! 🚀

The goal is to, by the end of May, launch a real network (where we run ~10 nodes ourselves on GCP) with some real sample data pinned to those nodes usable by the public.

  • Custom container images support so people can bring their own workloads (unverifiable, since arbitrary nondeterministic docker)
  • Generally reliable, production-ready codebase - ready to start scaling contributors as well

Nice to have for the launch:

  • Technology preview of Python WASM (which in future will be verifiable because we should be able to make it deterministic)

Big refactor

The goal here is to add support for custom docker images along with ripping out a lot of prototype code and upgrading it to production-readiness with nice interfaces.

We have made great progress with this and the code in the docker-image branch is in a much better state, with in-memory implementations to enable unit testing and a new IPFS docker sidecar storage driver. We've had some battles with stability with the IPFS docker sidecar with FUSE, and are also working on an alternative, simpler ipfs get based implementation. (It turns out the ipfs daemon copies the data locally anyway, so just doing an ipfs get before starting the container is way simpler and just as expensive in terms of disk usage/copying. We can optimize away the copying in a later phase.)

Tasks completed:

  • noop executor and inprocess scheduler
  • test the scheduling logic with the noop and inprocess mocks
  • fuse mount ipfs docker storage driver
  • start and manage the fuse mount docker sidecar container
  • debug and test for the fuse mount ipfs storage driver
  • general tooling:
    • devstack with ipfs servers only
    • function waiters with early error cancellation
    • test utils for adding files and waiting for jobs

WASM

The big news here is that the PyScript project launched in the PyCon keynote and was all over Twitter. This is a push to run Python in the browser via WASM. This is great news for us because it's based on pyodide, the Python WASM runtime we were already looking at, and this means there will be a lot more attention on Pyodide and folks making their libraries work with it if they don't already. PyOdide already supports quite a few packages and while PyTorch and Tensorflow are quite a way off, numpy, pandas, scikit-learn and scikit-image are working and those are great staples for Python data engineering and data science work.

We spent quite a lot of time scrabbling around in the dirt looking for ways to get Pyodide running in WASI (e.g. wasmtime) with library support, but concluded that's a ways off. This has to do with Pyodide dependencies on the Emscriptem APIs, full details here. Instead, we are going to go with Node.js support. Here is a POC of Node.js running entirely outside the browser (in the CLI), running pandas data engineering code:

async function main() {
    let pyodide_pkg = await import("./pyodide/pyodide.js");
    let pyodide = await pyodide_pkg.loadPyodide({indexURL: "./pyodide/"});
    await pyodide.loadPackage("micropip");

    console.log(await pyodide.runPythonAsync("1+1"));
    console.log(await pyodide.runPythonAsync(`
import micropip
micropip.install("pandas")
`))
    console.log(await pyodide.runPythonAsync(`
import pandas
from io import StringIO

# CSV String with out headers
csvString = """Spark,25000,50 Days,2000
Pandas,20000,35 Days,1000
Java,15000,,800
Python,15000,30 Days,500
PHP,18000,30 Days,800"""

# Convert String into StringIO
csvStringIO = StringIO(csvString)
import pandas as pd
df = pd.read_csv(csvStringIO, sep=",", header=None)
print(df)
`));
}

main()
root@903bf06785ea:/src# node n.js
warning: no blob constructor, cannot create blobs with mimetypes
warning: no BlobBuilder
Loading distutils
Loaded distutils
Python initialization complete
distutils already loaded from default channel
Loading micropip, pyparsing, packaging
Loaded micropip, packaging, pyparsing
2
distutils already loaded from default channel
pyparsing already loaded from default channel
Loading pandas, numpy, python-dateutil, six, pytz, setuptools
Loaded six, python-dateutil, pytz, setuptools, numpy, pandas
undefined
        0      1        2     3
0   Spark  25000  50 Days  2000
1  Pandas  20000  35 Days  1000
2    Java  15000      NaN   800
3  Python  15000  30 Days   500
4     PHP  18000  30 Days   800
undefined

Look, pandas dataframes in WASM!

So now the next questions/tasks are:

  • Can we make it deterministic by turning off all sources of entropy? (by hacking the pyodide npm package?)
  • How do we hook up data from ipfs? can we just pass a file descriptor connected to a CID into wasm? maybe using https://github.com/ipfs/js-ipfs in the outer node code, or our ipfs docker sidecar?
  • Implement a wasm compute driver and a CLI subcommand for submitting a python function and requirements.txt
Clone this wiki locally