There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. this blog post from BioinformaticsZen
).
Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on. And obviously, bioinformaticians want to keep their data
in databases rather than files…
A typical Rakefile could look like this:
task :001_load_data do
end task :002_calculate_averages => [:001_load_data] do end task :003_make_histogram_of_averages => [:002_calculate_averages] do end
The trouble is that there is no way yet to check whether a task has to
be rerun or not, because there are no timestamps. Regular rake will
rerun all three tasks from the example above, regardless if some of
them have already been completed.
BioRake adds this task timestamp functionality to rake for working with
databases. The functionality needed is very similar to the one available for
FileTasks.
So if we had reloaded the data (001), the timestamp for that task in a
metadata would be later than the one for task 002. As a result, task
002 would automatically have to be rerun if we were to run task 003.
gem sources -a http://gems.github.com (you only have to do this once)
sudo gem install jandot-biorake
I’ve started to implement an additional type of task, called
event. The above snippet from a Rakefile would actually contain
event :001_load_data do ... end event :002_calculated_averages => [:001_load_data] do ... end
instead of using the task tag.
Similar to a FileTask, timestamps are used to check if certain tasks
have to be re-run or not. FileTasks have the advantage that every file
has a timestamp. To implement this the metadata of event completion
times is stored in the .rake directory inside the current directory.
A event task automatically:
- checks the metadata to see if the task has already been run
- if so: are there any prerequisites with timestamps that are newer than the task
itself? - (re)run the task if necessary
- update the metadata
To re-run all tasks from scratch issue a Rake::EventTask.clean or simply
rm -rf .rake
to reset the metadata to before any events have occured.
Even though the tests seem to run and I’ve tried some things out, I
can’t guarantee production-level stability (well: call it beta). Use
at your own risk.
The sample/ directory contains an example Rakefile. Suppose a
researcher has intensities for a group of individuals on a number of
probes. This information should be loaded into a database with the
tables individuals, probes and intensities.
As the intensities table contains foreign keys for individual and
probe, the individuals and probes tables have to be loaded
before the intensities can be loaded.
In rake-speak, this would look like:
event :load_probes do
load the actual data
endevent :load_individuals do
load the actual data
endevent :load_intensities => [:load_probes, :load_individuals] do
load the actual data
end
In a later step, the researcher might want to calculate the average
intensity per probe. This would be a new task that depends on the
intensities being loaded:
event :calculate_averages => [:load_intensities] do _calculate averages and store in probes table_ end
Here, we call the database that will contain the data sample.sqlite3. The
metadata about completed events is stored in the .rake directory.
Try a rake -T…