-
Notifications
You must be signed in to change notification settings - Fork 460
The BOINC out of box experience
Suppose a scientist (let’s call her Mary) needs lots of high-throughput computing and can’t afford the usual sources. Let’s assume that
- Mary’s programs are Linux/Intel executables or Python scripts. She normally runs them on a Linux laptop.
- Mary has access to a Linux server on the Internet. She doesn’t necessarily have root access, but can ask a sysadmin to install packages.
- Mary knows Linux as a user, but not Docker, databases, web servers, AWS, etc.
Mary hears about volunteer computing and BOINC, and decides to investigate it. Mary will use BOINC only if this initial “out-of-box experience” (OOBE) is positive; i.e. she quickly tries out BOINC and is convinced that it works, that it’s useful to her, and that she wants to use it going forward. The ideal scenario is something like:
- Mary hears about BOINC and goes to the web site.
- Within ~1 hour she successfully runs jobs, using existing applications, on ~100 volunteer computers.
- What she ends up with is something that she can continue to use in production, and to which she can add other applications, GPU apps, larger volumes of jobs and data, BOINC features like result validation, etc.
The current BOINC OOBE doesn’t achieve this. The main BOINC server documentation is a sprawling mess. Marius’ Docker work (https://github.com/marius311/boinc-server-docker/blob/master/docs/cookbook.md) is a big step in the right direction, but more is needed to complete the above scenario.
BOINC competes with systems like HTCondor and AWS. We should study the OOBEs of these systems, borrow their good ideas, and make sure that we’re competitive.
The following is a sketch of what I think the OOBE should be like. The target configuration involves:
- A “server host”. This runs a BOINC server, as a set of Docker containers. It must be on a machine visible to the outside Internet, possibly a cloud instance.
- One or more “job submission hosts”. Scientists log in to these to do their work. They may be behind a firewall.
This involves downloading a .gz file containing the BOINC server software and some VM and docker images. Then you run a script that asks one or two questions, then creates and runs a server (as Docker processes). It creates a read-me file saying:
- How to make the server start on boot (edit /etc files).
- Where the config files are in case you need to change something later.
Admin functions (start/stop server, create accounts for job submitters) are done through a web interface. After the initial setup there should be no need to log in.
This involves installing a package that contains job submission scripts (see below) but not the BOINC server.
We should handle at least two cases:
- The scientist has an executable and the libraries it needs.
- The scientist has a Python script and the modules it needs.
In each case, let’s assume that all files for an app are stored in a directory.
To submit a job:
boinc_run --app app_dir_path
Run this in a directory containing input files. It makes a job with those input files, running the given app. The file “cmdline”, if present, contains command-line args.
To run multiple jobs, create a directory for each job, and put input files there. Then do
boinc_run_jobs --app app_dir_path dir1 dir2 ...
To see the status of the job(s) started in the current directory:
boinc_status
If the job failed, show info like stderr output.
To abort jobs started in the current directory.
boinc_abort
To fetch the output files of completed jobs started in the current directory.
boinc_fetch
Note: fancier features can be added to this, but the basic features are ultra-simple. No XML editing, estimating job sizes, etc.
The implementation shouldn’t be that hard. It’s based on technology we alreadyhave: boinc-server-docker and boinc2docker, and the remote job and file management mechanisms.
The server host setup script creates a BOINC project running in Docker containers, equipped with the VBox-based universal app, and some standard Docker containers, e.g. for Python apps.
On the submission host, each user has a directory ~/.boinc to contain various configuration and status files. A file ~/.boinc/apps contains a list of applications that have been used. Each one is identified by a directory path. We keep track of the mod time of the directory and the files in it; we maintain a Docker layer corresponding to the application.
The boinc_run command (a Python script) does the following:
- Check ~/.boinc/apps to see whether we have a Docker layer for the app. If not, build one using boinc2docker.
- Use the remote file management mechanism to copy files (app and input) to the Apache container.
- Use the remote job submission mechanism to submit the job. Write its ID to a file.
boinc_status etc. use the remote job submission mechanism.
The scientist starts by running the BOINC client on one or more of their own computers (possibly Windows or Mac), and attaching to the project.
When things are working and they’re ready to scale up, they register with Science United, supplying their keywords. The vetting process may take a day or two. This will typically provide them with several hundred hosts.
Another possibility is to allow Science United users to register as “testers”, and to add a mechanism where projects can register as “test projects” on SU, with no vetting. Such projects would be allowed only to use VM apps with no network access (we’d need to add a mechanism for this). They’d get some number of hosts (50-100) for a few days.
Once we have this working, we need to reorganize the server docs in such a way that scientists are initially steered toward the OOBE described here, but can still access lower-level info.