3 The BOINC out of box experience
David Anderson edited this page 2023-04-28 13:30:38 -07:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Suppose a scientist (lets call her Mary) needs lots of high-throughput computing and cant afford the usual sources. Lets assume that

  • Marys programs are Linux/Intel executables or Python scripts. She normally runs them on a Linux laptop.
  • Mary has access to a Linux server on the Internet. She doesnt necessarily have root access, but can ask a sysadmin to install packages.
  • Mary knows Linux as a user, but not Docker, databases, web servers, AWS, etc.

Mary hears about volunteer computing and BOINC, and decides to investigate it. Mary will use BOINC only if this initial “out-of-box experience” (OOBE) is positive; i.e. she quickly tries out BOINC and is convinced that it works, that its useful to her, and that she wants to use it going forward. The ideal scenario is something like:

  1. Mary hears about BOINC and goes to the web site.
  2. Within ~1 hour she successfully runs jobs, using existing applications, on ~100 volunteer computers.
  3. What she ends up with is something that she can continue to use in production, and to which she can add other applications, GPU apps, larger volumes of jobs and data, BOINC features like result validation, etc.

The current BOINC OOBE doesnt achieve this. The main BOINC server documentation is a sprawling mess. Marius Docker work (https://github.com/marius311/boinc-server-docker/blob/master/docs/cookbook.md) is a big step in the right direction, but more is needed to complete the above scenario.

BOINC competes with systems like HTCondor and AWS. We should study the OOBEs of these systems, borrow their good ideas, and make sure that were competitive.

The goal

The following is a sketch of what I think the OOBE should be like. The target configuration involves:

  • A “server host”. This runs a BOINC server, as a set of Docker containers. It must be on a machine visible to the outside Internet, possibly a cloud instance.
  • One or more “job submission hosts”. Scientists log in to these to do their work. They may be behind a firewall.

Setting up the server host

This involves downloading a .gz file containing the BOINC server software and some VM and docker images. Then you run a script that asks one or two questions, then creates and runs a server (as Docker processes). It creates a read-me file saying:

  • How to make the server start on boot (edit /etc files).
  • Where the config files are in case you need to change something later.

Admin functions (start/stop server, create accounts for job submitters) are done through a web interface. After the initial setup there should be no need to log in.

Setting up a job submission host

This involves installing a package that contains job submission scripts (see below) but not the BOINC server.

Running jobs

We should handle at least two cases:

  • The scientist has an executable and the libraries it needs.
  • The scientist has a Python script and the modules it needs.

In each case, lets assume that all files for an app are stored in a directory.

To submit a job:

boinc_run --app app_dir_path

Run this in a directory containing input files. It makes a job with those input files, running the given app. The file “cmdline”, if present, contains command-line args.

To run multiple jobs, create a directory for each job, and put input files there. Then do

boinc_run_jobs --app app_dir_path dir1 dir2 ...

To see the status of the job(s) started in the current directory:

boinc_status 

If the job failed, show info like stderr output.

To abort jobs started in the current directory.

boinc_abort

To fetch the output files of completed jobs started in the current directory.

boinc_fetch

Note: fancier features can be added to this, but the basic features are ultra-simple. No XML editing, estimating job sizes, etc.

Implementation

The implementation shouldnt be that hard. Its based on technology we alreadyhave: boinc-server-docker and boinc2docker, and the remote job and file management mechanisms.

The server host setup script creates a BOINC project running in Docker containers, equipped with the VBox-based universal app, and some standard Docker containers, e.g. for Python apps.

On the submission host, each user has a directory ~/.boinc to contain various configuration and status files. A file ~/.boinc/apps contains a list of applications that have been used. Each one is identified by a directory path. We keep track of the mod time of the directory and the files in it; we maintain a Docker layer corresponding to the application.

The boinc_run command (a Python script) does the following:

  1. Check ~/.boinc/apps to see whether we have a Docker layer for the app. If not, build one using boinc2docker.
  2. Use the remote file management mechanism to copy files (app and input) to the Apache container.
  3. Use the remote job submission mechanism to submit the job. Write its ID to a file.

boinc_status etc. use the remote job submission mechanism.

Computing resources

The scientist starts by running the BOINC client on one or more of their own computers (possibly Windows or Mac), and attaching to the project.

When things are working and theyre ready to scale up, they register with Science United, supplying their keywords. The vetting process may take a day or two. This will typically provide them with several hundred hosts.

Another possibility is to allow Science United users to register as “testers”, and to add a mechanism where projects can register as “test projects” on SU, with no vetting. Such projects would be allowed only to use VM apps with no network access (wed need to add a mechanism for this). Theyd get some number of hosts (50-100) for a few days.

Restructuring server documentation

Once we have this working, we need to reorganize the server docs in such a way that scientists are initially steered toward the OOBE described here, but can still access lower-level info.