Skip to content

Latest commit

 

History

History
231 lines (171 loc) · 8.85 KB

README.md

File metadata and controls

231 lines (171 loc) · 8.85 KB

Computation Evaluator (CompEval)

CompEval is a tool for evaluation of workflows composed from expensive computation tasks with shared sub-expressions. GitEval uses the Git version control system to both represent the workflow as a directory tree and each of its particular instances. This enables easy deployment in computation clusters.

Goals

CompEval is not the simplest possible implementation of the concept, but we hope it is the simplest possible implementation of our particular formulation of the design goals:

  • Version management. Be sure which version of your implementation computed the given result.

  • Easy deployment. Updating and recomputing should be easy and fast (without recomputing what didn't need to be).

  • Human-accessible storage. All stages of the computation shall be easily accessible for user inspection.

This mostly eliminates using a script to guide the workflow execution due to (i) difficulty of eliminating common sub-expressions on higher than leaf level, and (ii) difficulty of storing contents of temporary variables (named or unnamed). Therefore, we opt for functional programming semantics and a very explicit representation of the expression tree.

Concepts

Computation

The computation is a named executable that takes one or more inputs and transforms them to a set of outputs. The inputs and outputs are stored in files and the filenames are passed as arguments to the executable; the format of the files is arbitrary from CompEval's perspective.

Conceptually, the computation will be a computationally intensive task that may take a long time to run, but it is pure in the sense that when run on the same inputs, the computation will always produce the same outputs (up to an isomorphism / with high probability, in case of probabilistic algorithms - the point is, we can safely reuse previously obtained results).

An instance of a computation on given inputs and obtaining given output is a call (i.e., a call is a triplet (computation, inputs, outputs)). It is named based on .

Workflow

The workflow is a tree of computation expressions. Each sub-tree is also a workflow (or, if you will, a sub-workflow). A computation in each node of the expression tree obtains its inputs from either

  • outputs or output slices of sub-workflows,
  • global inputs or input slices of the whole task (see below), or
  • literals stored in the expression tree.

A particular workflow expression tree executed with given set of global task inputs is called a run, performing calls on tree nodes leaf-first.

Task

The task represents a whole "program" we want to execute, processing a set of externally-supplied input data, obtaining the final output we desire from the system. The task therefore consists of a workflow with the whole expression tree and a set of inputs (that are globally available within the workflow tree), producing a set of outputs (corresponding to the outputs of the root node of the workflow tree).

An instance of a task on given inputs and obtaining given output is a job (i.e., a job is a triplet (task (= workflow), inputs, outputs)).

Representation

Tasks

A task is a named Git repository with directory structure like

computations/...
workflow/...
inputs

The computations/ subdirectory is usually a submodule and contains a library of computation executables; alternatively, each computation could reside in a submodule. The workflow/ subdirectory tree contains the computation expression tree.

The inputs/ text file shall contain a line for each input passed, assigning a symbolic name to each input for referral from workflows.

Workflows

A workflow subdirectory for a computation is organized like:

kind
value
inputs/nn-text/...
inputs/nn_mm-text/...
outputs

The kind file contains a single line "computation". The value file contains a single line with the name of the computation to run.

The inputs/ subdirectory contains a directory for each input (where nn is number of the input two digits zero-padded, numbered from zero, and text is free-form human readable description); the directory contains another sub-workflow. In case multiple inputs are generated by this sub-workflow, the nn_mm-text naming convention must be used, describing the range of inputs supplied by this sub-workflow.

The outputs file contains the slice of computation outputs to pass up through the expression tree, one line per output, containing the number of computation output. E.g. single line 0 says that just the first computation output is to be used as the first workflow output, while two lines 01 and 00 say that the second computation output will be used as the first workflow output and the first computation output will be used as the second workflow output.

A workflow subdirectory representing a global input or a literal is organized like

kind
value

with the kind file containing "input" or "literal", respectively, and the value file containing either symbolic name of the global task input or raw file name in the global file storage (see below), respectively.

Computations

The computations subdirectory of a task repository is usually a submodule and the library of available computations is shared between most or all tasks. Each computation is referred by a name that corresponds to a directory in the computations subdirectory or repository. The computation directory has a structure like this:

inputs
outputs
exec

The inputs file contains one line per required input, each line containing a symbolic name of the input; this information is used for debugging and error reporting. The outputs file contains one line per produced output in the same format.

The exec file shall be executable and will be executed when a call is issued. It receives #inputs + #outputs parameters with the names of files to read inputs from and to write outputs to.

Possibly, exec would be a wrapper over the executable itself, e.g. making sure it's checked out and built on the current host. However, this is a different layer and entirely transparent to CompEval. In case this model is used, each exec file version should be tied to a single particular version of the main executable to ensure integrity of the whole evaluation and call reuse.

Global File Storage

All inputs and outputs encountered in the processing of a single job are stored in a "global file storage". This is simply a directory for now, possibly on a network filesystem; its semantics might be enriched in the future to work as a distributed file system.

Each output is assigned a filename that is based on the particular computation that produced it and inputs that were used for its production

c_cname/ccid/nn/input0_input1_...

where c_ is literal, cname is the symbolic name of the computation, ccid is HEAD commit id of the computations/cname subdirectory, input0 etc. are SHA1 hashes of the inputs of the computation and nn is the output number (00 for the first output, etc.). Only first twelve digits of each hash are used in the filename.

Each task input is also stored in the global file storage for future reference, assigned a filename in the format

tinputs/hash2/hash10

where tinputs is literal, and considering SHA1 hashes of the contents of the task inputs, hash2 are the first two digits and hash10 are the next ten digits.

The job outputs are stored in file names of the format

t_tname/tcid/nn/input0_input1_...

where t_ is literal, tname is the symbolic name of the task (determined from the name of the directory holding the Git repo), tcid is HEAD commit id of the task repository, input0 etc. are SHA1 hashes of the global inputs of the task and nn is the output number (00 for the first output, etc.). Only first twelve digits of each hash are used in the filename.

Usage

The CompEval tool is run on a computer where we wish to carry out the computation, at the root of the task repository. Its first argument is path to the global file storage. The tool can be used for execution or inspection.

The basic command is ce-run [-n] STORAGE INPUTS.... This will start a job executing the current task's workflow tree, printing out the tree in a friendly form as the execution proceeds together with filenames of inputs / outputs for further manual inspection.

The ce-run -n mode is a slighly modification to be used for inspection. It will also walk the current task's workflow tree, but it will only reuse already obtained results and will not start computations for missing results.

Another command for inspection is ce-sym that will just print out the current task's workflow tree in a symbolic form using a LISP-like functional syntax.

A more visual alternative is ce-viz which will render a graphical representation of the task's workflow, with inputs/outputs decorated with human-readable labels.