-
Notifications
You must be signed in to change notification settings - Fork 7
Documentation
gkno has been updated to make use of the networkX Python library for graphs. Pipelines are now constructed as a graph as is discussed later in the documentation.
Active integration of gkno and the iobio, web-based, real-time, visually driven analysis suite is underway. Keep your eyes open for high level integration later this year.
- Overview
- Terminology
- Installing gkno
- General gkno description
- Tool mode
- Pipeline mode
- Admin mode
- GNU make
- Logging
- Configuration files
- Tool configuration files
- Pipeline configuration files
- Handling filename stubs
- Instances
- Handling multiple data sets
- Resource management
- Add a resource
- Update a resource
- Remove a resource
- Available tools
- Mosaik
- Bamtools
- Freebayes
- Tutorials
- General tutorials
- Building and modifying pipelines
- Tools
- Pipelines
- A unified launcher bringing together software tools into a single environment,
- a general framework for generating linked lists of tools allowing integrated pipelines to be constructed,
- and a web environment providing easy access to documentation and tutorials, user forums, blogs and bug reporting.
The web environment and the Twitter feed @gknoProject keep people up to date on the work being performed with gkno and useful information that different users post in the forum. The documentation and tutorials provide clear instructions on how to download and execute gkno as well as more in depth information about the included tools, pipelines and configuration files.
A core goal of the package is to enable inexperienced users to simply download and execute predetermined analysis pipelines in order to generate sensible results for their research projects. The intricacies of the pipelines (including which processing tools and sensible parameter sets) are all hidden in configuration files and only advanced users need interrogate them.
##Terminology Throughout this documentation, the terminology used is generally closely related to Python objects. In descriptions of json files, what in json terminology are termed objects are referred to as dictionaries and json arrays are termed lists. When Python libraries parse the json files, these are the Python objects in which the values are stored.
The following tools are necessary to obtain and install gkno. They are probably already present on most Unix-style systems, but if not, are available via the links provided or through the system's package manager.
- Mac users can simply install Xcode. This is necessary anyway for other dependencies (gcc/g++), and already provides git "out of the box."
Type the following commands:
git clone https://github.com/gkno/gkno_launcher.git
cd gkno_launcher
./gkno build
The build step begins by checking the user's system for the following additional dependencies:
- ant
- cmake
- gcc / g++
- java / javac
- make
These mostly consist of support for building & running the component tools. If any of these are either missing or not up to some required minimum version, gkno will print a message informing which tool(s) need to be installed/updated. After all dependencies are satisfied, gkno will initialize all of its internal components - by fetching & compiling software tools and then downloading default (tutorial) resource data.
Upon successful completion, the executable ./gkno
can now be used to run any of the tools and pipelines in gkno. This executable is also used to manage data resources for pipelines.
The following command:
./gkno run-test
will run a basic pipeline on tutorial data. In addition to checking that the internal components were built properly, this command provides the user a first look at gkno "in action" as it processes a pipeline.
The gkno launcher is designed to bring the wealth of next-generation DNA sequencing analysis software into a single, easy to use command line. The power of the launcher is the ability to bring together multiple tools into a single analysis pipeline with the minimum of required user input. A pipeline is defined in a configuration file that can be quickly and easily constructed and is then available for repeated use. When the command line is executed, gkno generates a makefile that is automatically (unless specified otherwise by the user) executed using the GNU make framework. This system ensures that each tool is aware of its file dependencies and includes rules to determine how all of the necessary files are to be created. If a tool fails, any files created in the failed step are deleted and the user is informed of where the problems occurred. This ensures that no partially constructed files will be made available to the user, leading to the potential of analysis based on incomplete data. In addition, having identified and fixed the problem, rerunning the pipeline will start at the last possible point in the pipeline. Files that were successfully generated in the first run will not be unnecessarily regenerated.
### Tool mode _gkno_ provides the user access to all of its constituent tools. Each tool in _gkno_ is described by configuration file in _json_ format. This file describes the executable commands, the tool location, all of the allowed command line arguments, the expected parameters, data types and default values. Common arguments across tools are given the same arguments, as far as is possible, providing commonality between the command lines for all tools, making it straightforward to switch between different tools. In general, the user should have no need to deal with the configuration files, but a complete description of the format of the configuration files is given in the '_Configuration files_' section. A list of all the available tools can be seen by typing:gkno --help
In order to run a tool, the user simply needs to specify the name of the tool to run. In order to get extra information (e.g. the available command line arguments), help can be displayed by typing:
gkno <tool> --help
gkno pipe --help
In order to see all of the available command line arguments for a particular pipeline, the following command line can be used:
gkno pipe <pipeline name> --help
Executing the command line above lists all of the arguments available as part of the specified pipeline. The pipeline arguments are not, however, the complete set of arguments available to all of the constituent tools. If the user wishes to set a parameter in one of the pipelines' tools, but this is not an available pipeline command line argument, the argument can still be accessed. To set arguments for a specific tool, the pipeline task can be supplied as an argument and then task specific arguments are enclosed in square brackets. For example, consider the pipeline, build-moblist-reference. This pipeline uses the tool mosaik-jump for task build-jump-database, but the argument --iupac
, but there is no pipeline argument to set this. The available arguments can be seen by typing:
gkno pipe build-moblist-reference --help
If this argument was required, the following command line would set it:
gkno pipe build-moblist-reference --build-jump-database [--iupac]
All of the commands for (in this example, build-jump-database) are contained within the square brackets. The pipelines are designed in such a way that the commonly accessed commands for each of the constituent tools are accessible via the standard command line, but advanced options may require using this syntax.
###Admin mode _gkno_ provides an "admin" mode with various features for updating _gkno_ and managing resources. The following commands are considered "admin" operations:-
gkno build
- initialize gkno & build component tools. See installation section. -
gkno update
- update component tools and check for available resource updates. -
gkno add-resource
- add genome resource data. -
gkno remove-resource
- remove genome resource data. -
gkno update-resource
- update genome resource data.
See resources section for more info on the commands related to resource management.
###GNU make The _gkno_ package uses the _GNU make_ system to execute tools and pipelines. On execution of a _gkno_ pipeline, a _makefile_ is generated. The general framework of the _makefile_ is a list of blocks describing what files are required by a '_rule_' and the files that are output when the '_rule_' is executed. The _rule_ is itself one or more command lines. When executed (using the command ``make --file ``), _make_ searches for the final required output files and all of the _dependencies_, e.g. the files that are required to make the output files. If the final files do not exist, or any of the dependencies are missing or were created more recently than the output, _make_ will try to execute the rule. In the absence of some of the dependencies, _make_ will search for a _rule_ describing how to generate this dependency and so on.The important thing to note is that after the pipeline has been executed, it can be rerun at any point by using the make --file <makefile name>
command. If all files generated by the pipeline exist and none of the input files are newer (e.g. have been modified) than the output files, no tools will be executed. If any files have been modified or deleted, the pipeline will be begin execution where these files are relevant. Already existing files will not be recreated.
If the same pipeline is being run multiple times, this can be important. Consider a Mosaik based alignment pipeline, whose first tasks prepare genome reference files. Once the reference files exist, the provided sequence reads are then aligned to the reference. If the pipeline is rerun for a different set of sequence reads, there is no need to regenerate all of the reference files, since these will be unchanged from the first run of the pipeline. So, when the pipeline is run for the second time, it will start with the read alignment tasks and use the already existing reference files.
See the 'Using GNU make' tutorial for worked examples of using the GNU make framework. ###Logging gkno usage is logged in order to keep track of which tools and pipelines are most commonly used in the community. Every time gkno is launched, an ID of the form tool/ or pipe/ is generated and sent back to the Marth lab. No information about the user/location etc. is tracked, just the tool or pipeline executed. ##Configuration files The Python code describing the gkno launcher does not include any hard-coded information about any tools or pipelines. Instead, each tool and pipeline is described by a configuration file in json format.
This section of the documentation describes the format of the json configuration files in some detail and is not intended for the user just wanting to get started with the gkno package. For a more hands on description of how to use gkno or modify specific aspects of the configuration files, specific tutorials with worked examples have been developed. These are included in the documentation, but are also available on the gkno website under the Tutorials tab.
All of the configuration files are validated and processed using a separate Python library included with gkno. For the purposes of this documentation, when reference is made to the underlying Python code, this includes both that contained in gkno and that contained in configurationClass. For users wanting to interrogate the code base, note that all functions directly relating to the configuration files is handled by this separate class.
###Tool configuration files The tool configuration files describe all of the information necessary to run each of the individual tools. There are many occasions where a single tool actually has multiple configuration files. Consider the tool _bamtools_; this tool comprises multiple modes and the command line arguments depend on the mode being used. For example, the command ``bamtools sort`` has only two possible arguments; the input _bam_ file and an optional flag. The command ``bamtools filter`` also has an argument for the input _bam_ file, but there are several other optional arguments. Instead of complicating the tool configuration files by building in logic that allows certain arguments depending on others, separate configuration files exist for each distinct mode of operation. Looking at the help (``gkno --help``) reveals that there are multiple different tools of the form _bamtools-_. Each of these configuration files contains the arguments relevant to the particular tool mode and no others.The tool configuration file consists of a number of required and optional fields, summarised in the list below. ####Required fields
- arguments: a dictionary of all the valid command line arguments for this tool. See the 'Tool arguments' section for more details.
- description: a brief description of the tool and its role. This text appears in the pipeline help and so its inclusion is necessary in order to ensure clarity.
- executable: the name of the executable file.
- help: the help command for this tool (usually --help or -h).
- instances: define values to apply to the tool arguments. This is a very useful feature and is dealt with in detail in section 'Instances'.
- path: the location of the executable file within the gkno package.
- tools: a list of the names of the tools whose compilation is required for the tool to execute. The values is in the list must be the names of tools in the gkno package.
####Optional fields
- argument delimiter: modifies the format of the argument/value pair on the command line. See the section 'Defining argument delimiters' for more details.
- argument order: the command lines for some tools do not use arguments, but the values on the command line are required to be in a specific order. For these tools, the argument order field lists all of the command line arguments in the order they must appear on the command line. See section 'Argument order' for more details.
- experimental: a flag that identifies the tool as experimental. This means that the tool is identified in the help as a tool that should be used with caution.
- hide tool: hide tools from the user.
- input is stream: some tools only operate on the stream and, as such, do not have command line arguments for the input files as the stream is assumed (ogap and bamleftalign are examples of such tools). By setting input is stream in the tool configuration file, gkno will ensure that files are piped to the tool.
- modifier: modifies the executable command with a suffix.
- precommand: modifies the executable command with a prefix.
Some of these fields can themselves include a number of options and require explanation. These are covered in more detail below.
####Tool arguments The bulk of the tool configuration file is the definition of all the command line arguments available for the tool. The arguments for each tool are organised into different groups; each named group being a list of dictionaries. Each dictionary contains all of the required and optional information for a specific argument. All input files need to be in the _inputs_ group and all output files in the _outputs_ group. _gkno_ determines whether arguments point to files by their presence in these groups, so it is essential that this convention is followed. Outside of these groups, there can be as many groups with any name (although each group name can only be used once within the configuration file). Each dictionary within the group contains a combination of the following fields:####Required fields
- command line argument: the argument that the tool expects to receive. The long form argument and short form argument fields define the argument for the gkno command line, but the argument expected by the tool is often different and so is defined here.
- data type: the expected data type associated with this argument. This can be one of the following: string, int, float, bool or flag. On the command line, all arguments will expect a value to be provided unless the data type is set to flag.
- description: a brief description of the command line argument used in the help messages.
- extensions: a list of the allowed extensions for the file associated with this argument (including the preceding '.'). If this argument is not associated with a file, this should be set to no extension.
- long form argument: a long form version of the command line argument.
- required: A Boolean indicating if the file associated with this argument is required for successful operation of the tool. If required is set to true and the file is not provided, gkno will terminate highlighting that this file is missing. If not present, it is assumed that this file is not required.
- short form argument: a short form version of the command line argument. For example, the argument could be --fastq and the short form would likely be -f.
The optional fields are as follows: ####Optional fields
-
allow multiple values: a Boolean, which, if set to true, instructs gkno that this command can appear on the command line multiple times. For example, if multiple inputs can be defined and the input file command is
--in
, using the command linegkno tool --in a --in b
will result in a and b being stored in a list. The command line in the makefile will then include this argument multiple times, for each supplied input. If the Boolean was not set to true (the default) and an argument is specified multiple times, gkno will terminate with a warning, rather than picking one of the supplied values or including all values on the command line. - directory: instructs gkno that this argument points to a directory.
- filename extensions: lists the extensions of the filenames produced by this argument. This is only used for arguments that use filename stubs, so the is filename stub value should also be set. Handling filename stubs is dealt with in the 'Handling filename stubs' section.
- hide in help: is a Boolean, which if set ensures that this argument does not get displayed when help on the tool is requested. For occasions where this is useful, see section 'Defining additional input/output files'.
- if input is stream: applied to an argument for an input file, this section describes the behaviour of the command line argument if the input is a stream rather than a file. This value that accompanies this can be either do not include or replace.
- If the value is do not include, the argument will be omitted from the command line. Internally, gkno will be keeping track of filenames in order to define filenames further along the pipeline, but the tool command line will no longer include this argument.
- If the value is replace, this argument must also include the field replace argument with. This is a dictionary that contains the two keys, argument and value. argument is associated with a string. This string is the text that the argument is replaced with. value is the value that accompanies the argument (if this is blank, this can be set to no value). For example, if the tool freebayes is fed a stream rather than an input bam file, the input argument
--bam <file>
needs to be replaced with--stdin
. This is accomplished by including the following in the configuration file under arguments and the--bam
section. - is filename stub: identifies the argument as defining an file stub, e.g. multiple files with this value are created with extensions defined by the filename extensions field. See the section on 'handling filename stubs' for more details on handling filename stubs in tools and pipelines.
- is stream: if set to true identifies the argument as the one holding files to stream to the tool. If no stream is piped to the tool, and the tool expects a stream, the argument with the is stream field set will be used to generate a stream for the tool.
"if input is stream" : "replace",
"replace argument with" : {
"argument" : "--stdin",
"value" : "no value"
}
- if output to stream: allows modification of the argument (this is specifically for arguments defining output files) if the desired output is to a stream rather than a file. Currently, the only allowed value for this field is do not include. This is only of relevance if the tool is included in a pipeline as part of a set of piped tasks, and it isn't the final task in the pipe stream. In this situation, the command line will be modified to ensure that the pipes link the tools together successfully.
- list of input files: if true, the input is a list of files. If this is set, the following fields must also be present:
-
apply by repeating this argument: When constructing the command line, each value from the list of input files will be included on the command line using the argument defined by this field. This argument must be a valid argument for the tool and that it has the allow multiple definitions field set to true, since this argument will be set multiple times. As an example, consider freebayes arguments
--bam
and--bam-list
.--bam
is the argument for inputting bam files and allows multiple files.--bam-list
allows the user to include a file with a list of bam files and these will ultimately all appear on the freebayes command line with the--bam
argument. The relevant fields are shown in the example below:
"--bam" : {
...
"input" : true,
"allow multiple definitions" : true
},
"--bam-list" : {
...
"input" : true,
"list of input files" : true,
"apply by repeating this argument" : "--bam"
}
-
modify argument name on command line: allows modification of the argument before being written to the command line. Some tools have command line constructions that mean that there are no actual arguments (just values) on the command line, or instead of defining an output file, the output is sent to stdout etc. In order to standardise the gkno interface, all of the arguments are still defined in the configuration file, however, when it comes to constructing the command lines in the makefile, the individual tools' formats need to be respected. The modify argument name on command line can take one of the following forms:
-
hide: when constructing the command line, hide the argument and only write the value. For example, if a command line should have the form
tool [input file] [output file]
, the configuration file may specify arguments--in
and--out
associated with the input and output files respectively. Other tasks in pipelines can then link to these arguments without any problems. If modify argument name on command line is left unset, the command line would take the form:tool --in [input file] --out [output file]
which would be inconsistent with that expected by the tool. By including:"modify argument name on command line" : "hide"
in the configuration file for both--in
and--out
, the command line would then take the required form. -
stdout: as with hide, but instead of just omitting the argument, the argument is replaced with the stdout '>' operator.
-
stderr: as with stdout, but instead of using the operator '>', the stderr operator '2>' is used.
-
omit: nothing will be written to the command line for this argument. The argument is essentially a placeholder that allows linkage etc.
-
is filename stub: for some tools, the input or output is defined on the command line without any extensions. The tool itself takes this stub and determines the full filenames internally. Arguments of this type have the is filename stub field set to true. When set, the following field is also required:
-
filename extensions: a list of the output extensions that will be generated by the tool (including any preceding '.').
-
construct filename: instructions on how to construct the filename if it hasn't been explicitly set. This section is discussed in the 'Construct filenames' section.
In addition to the above required fields, the field modify text can also be included to define additional changes to be made. The modify text field is accompanied by a list of dictionaries, where each dictionary is permitted one and only one key/value pair describing one operation. When making changes to the filename, the instructions are executed in the order in which they appear in the list. The allowed instructions are:
- add argument values: accompanied by a list. This list consists of one or more valid arguments for the tool. The values associated with the arguments in the list will be appended to the filename prior to any extensions.
- add text: accompanied by a list of strings (usually only one since multiple strings in the list will just be concatenated). This string will be added at the end of the filename, but prior to any extensions.
- remove text: accompanied by a list of strings (again, usually only one). The defined text will be removed from the filename.
As an example, consider the hypothetical example illustrated below.
"extensions" : [".out", ".out.gz"],
"construct filename" : {
"method" : "from tool argument",
"use argument" : "--in",
"modify extension" : "replace",
"modify text" : [
{
"remove text" : ["_1"]
},
{
"add text" : ["_"]
},
{
"add argument values" : ["--value"]
},
{
"add additional text" : "_test"
}
]
}
Construction would proceed by checking that the argument --in
is a valid argument for the tool. The extension associated with --in
would also be determined. For this example, let's assume that the value associated with --in
is input_1.in
. The modify text instructions are then processed in order, starting with the remove text instruction. The filename is checked to ensure that it ends with _1
and then this is removed to give input.in
. Next, the text _
is added giving input_.in
and then the tool is checked to ensure that --value
is a valid tool argument. Assuming that it is, the associated value (let's assume it is 10
) is appended to the name to give input_10.in
. Now the string associated with the final add text is added to give input_10_test.in
. Finally, the instructions demand that the extension is replaced. In this case, the extension for -in
is removed and replaced with the extension provided for the argument being constructed. In this case, the new extension can be .out
or .out.gz
. gkno chooses the first value in the list, so the final value associated with this argument is input_10_test.out
.
####Define name If method is set to define name, the filename is defined based on the contents of the construct filename block. When using this method, the following additional fields are required inside the construct filenames block:
- add extension: is a Boolean, which if set to true will add the extension for this argument to the final value.
- filename: is the string to be used for the file, excluding the extension.
In addition to the required fields, the following optional fields can also be defined:
- directory argument: accompanied by a valid argument for the tool that defines a directory. If this is set, the final filename will be prepended with the value associated with the directory argument followed by a '/'.
{
"description" : "the index file.",
"long form argument" : "--out",
"short form argument" : "-o",
"command line argument" : "-out",
"input" : false,
"output" : true,
"required" : true,
"data type" : "string",
"extensions" : [".bai"],
"hide in help" : true,
"include on command line" : false,
"construct filename" : {
"method": "from tool argument",
"use argument" : "--in",
"modify extension" : "append"
}
}
This argument does not need to be manually set and since the field hide in help is set to true, the user will not know that the argument exists at all. In addition, the include on command line field is set to false, so when the makefile is constructed, this argument will be ignored. However, the argument is required and its value is constructed as the value from --in
, with the extension .bai
appended as required. By including this argument in the configuration file, the outputs from this tool will be set. In addition, when working with pipelines, other tools can link to this index file. See the section 'Additional dependencies' for further information.
"argument delimiter" : "="
If the argument delimiter block is omitted, the default value is a single space.
####Hiding tools from the user There are some tools included in the _gkno_ package that have peculiar command lines or are only intended for use in a piped stream. For example, the _ogap_ tool expects to have a _bam_ file piped into it and outputs a _bam_ file to the stream. It is not straightforward to use these tools from the _gkno_ command line and so it is desirable to hide these tools from view. For example, in the list of available tools (``gkno --help``), _ogap_ and _bamleftalign_ are not visible. While these can't be seen as available tools, they can still be used in constructing pipelines like any other tool. To hide a tool, the _hide tool_ block takes the form:"hide tool" : true
If this block is omitted, the tool is assumed to be visible.
####Modifiers to the executable command Some of the tools included in the _gkno_ package appear in a command line with modifiers before of after the actual executable file. Tools that use _java_ may require some additional text before the executable file, for example ``java -Xmx4g -jar```. The _executable_ block defines the name of the executable file and is used to check that the executable actually exists, so it cannot be modified to include this additional text. In order to ensure that the command line is correctly constructed, the _precommand_ block can be used to define this additional text. For the _java_ example, the tool configuration file would include the block:"precommand" : "java -Xmx4g -jar"
The makefile would then correctly construct the executable command. There are also cases where text needs to be added after the executable. bamtools is a suite of tools that operates on bam files. The tool is constructed such that the executable file is called bamtools, but then the specific operation within bamtools needs to be defined. If the desired operation is the sorting of a bam file, the command line would have the form bamtools sort [arguments]
. The text sort can be defined using the modifier section. For this example, the modifier block would have the form:
"modifier" : "sort"
If these sections are omitted from the configuration file, the default operation is to include the executable file only in the command line, followed by the defined arguments.
####Argument order Some tools do not have arguments at all on the command line, instead, the values are supplied in a specific order. For example, a tool command line can be of the form:tool <option 1> <option 2> [input file] [output file]
The tool configuration file will provide command line arguments for each of these options and files, so that the gkno command line is consistent with all other tools. However, when the command line is written out in the makefile, the above syntax must be replicated. Within the argument definitions, the field modify argument name on command line will be set to hide, ensuring that when the arguments will not be included, only the associated values. In order to ensure that the values are written out in the correct order, the argument order defines a list include all of the arguments for this tool, in the order they should appear on the command line. So for the example command line above, the argument order will be defined as:
"argument order" : [
"--option1",
"--option2",
"--input",
"--output"
]
####Required fields
- description: a brief description of the pipeline,
- instances: definitions of values for the pipeline arguments. See section ('Instances')[#instances] for more details.
- _nodes: describe the details of the pipeline including pipeilne arguments and logical connection of tools. This is described in detail in section ('Pipeline nodes')[#pipeline_config_nodes].
- tasks: defines tasks in the pipeline with necessary information about each task. This is discussed in detail in section 'Pipeline tasks'.
####Optional fields
- experimental: a flag that identifies the pipeline as experimental. This means that the pipeline is identified in the help as one that should be used with caution.
Each of these individual components of the pipeline configuration file are discussed in detail in the following sections. Tutorials are provided that give worked examples on how to construct a basic pipeline configuration file Building and modifying pipelines and how to add the further options. Please refer to these for examples.
####Pipeline tasks A pipeline is basically a set of tools that are executed in a specific order, passing files between them. The __tasks__ section of the configuration file defines a name for each task in the pipeline. Each task is an operation to be performed and must be a unique name. For each task, the tool used to perform the task is specified using the required __tool__ field. Additionally, the task can be identified as outputting to a stream rather than a file using the optional __output to stream__ field. If this is set, _gkno_ will check that the tool is able to output to a stream and that the next task in the pipeline is capable of accepting a streaming input. As an example, consider a simple pipeline that calls variants with _freebayes_, then streams the output _vcf_ file into _vcflib_ for filtering. The __tasks__ section would take the form:"tasks" : {
"variant-call" : {
"tool" : "freebayes",
"output to stream" : true
},
"filter-variants" : {
"tool" : "vcflib-filter"
}
}
The order in which the tasks are defined is unimportant, since the order in which the tasks are executed is determined by the flow of files through the pipeline. However, it is typical to include the tasks in the order in which the order expects them to be run.
####Pipeline nodes The __nodes__ section consists of a list of dictionaries. Each one of these dictionaries has a set of required and optional fields as outlined below. The nodes are used to define arguments that can be used on the command line for the pipeline and to which tasks/arguments the assigned values point. In addition, the nodes define which tasks share which arguments and define how the information passes through the pipeline. The allowed fields are listed below, and those that require further explanation have separate sections in this documentation.#####Required fields
- description: describes the values associated with the node. If the node is assigned arguments, the description is what appears in the help message for this argument.
- ID: is a unique identifier for the node.
- tasks: is a dictionary of task/argument pairs. It is from this that the pipeline data flow is derived and so this is extremely important. The tasks section is described in more detail in section 'Pipeline task nodes'.
#####Optional fields
- delete files: a Boolean that, if set to true, instructs gkno to delete files associated with this node. gkno will determine when the files can be deleted. See section 'Deleting intermediate files' for more details on how to handle intermediate files.
- evaluate command: replaces the value for an argument with a command to evaluate at execution time. See 'Evaluating commands at execution time' for details.
- extensions: is only used in special cases where, for example, a task input is the output of a previous task, but the previous task output is a filename stub. This is looked at in more detail in the section 'Handling filename stubs'.
- greedy tasks: similar to the required tasks field above, but instructs gkno that the contained task argument are greedy. This is only a concern if multiple sets of input files have been provided to the pipeline, and indicates that all of the sets of files passing through the pipeline should be used together for this task. This is covered in more detail in the section 'Handling multiple data sets'.
-
long form argument: defines the long form of command line argument. Other fields in the node will connect this argument to arguments associated with tasks within the pipeline. It is conventional to use arguments that mirror the argument is the tools. For example, this node might allow the user to define a file, say file.fastq. This file might be used by several different tasks in the pipeline. An attempt has been made to standardise the command line arguments in all the tools, so hopefully, all those tools used a command like
--fastq
. In this case, the argument in the pipeline should also be set to--fastq
. - required: indicates if an argument is required. See the section 'Required pipeline arguments' for details and examples of when this is necessary.
- short form argument: the short form of the command line argument.
There are cases where there is no defined long or short form arguments. When this happens, the user doesn't have the opportunity to set these values (although the syntax described in 'Pipeline mode' to set the values of tasks in the pipeline is still valid). This in usually used to link the output of a task with the input of another task(s). In this case, the configuration file node would have the form:
{
"ID" : "link",
"description" : "Linking tasks",
"tasks" : {
"task-1" : "--out",
"task-2" : "--in"
}
}
In this case, the file output by task-1 would be used as the input for both task-2 and task-3. Since there is no argument associated with this node, nothing about this would be shown in the pipeline help.
The greedy_tasks field has the same form as tasks, but is used for cases where there are multiple data sets being processed by the pipeline. This is dealt with in section 'Handling multiple data sets'.
####Pipeline workflow The pipeline workflow is the order in which the tools are executed. This is determined by _gkno_ by performing a topological sort on the pipeline graph. After all tasks have been assigned to the graph as task nodes, all arguments are given option nodes and files are assigned file nodes. All option nodes are joined to the relevant task node by an edge (all option nodes are predecessors to the tasks, since they are providing information to the task). All file nodes can either precede or succeed the task node depending on whether they are input or output files for the task. Performing a topographical sort provides a non-unique order in which the tasks are executed, but it is assured that tasks that depend on the output of other tasks will appear after them in the workflow. While this workflow is unimportant for the produced _makefile_, it is useful for giving a human readable flow of tasks that helps the user understand the role of the pipeline. As an example, the pipeline _build-moblist-reference_ takes two _fasta_ files as input, merges them and then generates a reference file in _Mosaik_ native format as well as a set of _Mosaik_ jump database files. The pipeline graph is illustrated in the following figure:Performing a topological sort on this graph yields the workflow:
1. merge-fasta
2. build-reference
3. build-jump-database
4. create-sequence-dictionary
5. index-fasta
From the graph, it is clear that step 1 must occur first. After that, the only consideration is that step 3 must occur after step 2, but steps 2, 4 and 5 could be performed in any order. This is why the workflow is non-unique.
####Setting required pipeline arguments If an argument is required by a tool within the pipeline, not setting the pipeline argument that points to the particular task argument will result in an error. This is because the pipeline will fail to execute if any of the constituent tools within it do not have all their required parameters. There are cases where an argument for a tool is optional, but in the context of a pipeline, the argument needs to be set. In this case, including the field ``"required" : true`` in the pipeline configuration node, will ensure that if the argument isn't set, _gkno_ will terminate with an error if the argument isn't set.As an example, consider a pipeline that processes paired end reads and two fastq files are expected. Consider an aligner with a required argument --fastq
. This is required since no alignment can take place without some reads. A second argument --fastq2
is optional. If set, the aligner will work in a paired end mode, otherwise it will assume all reads are single ended. In a paired end read pipeline, it is necessary that both --fastq
and --fastq2
are set. In this case, the pipeline configuration file will include nodes linking to each of these tool arguments as follows:
...
{
"ID" : "first mate",
"description" : "The file containing the sequence reads for the first mate",
"long form argument" : "--fastq",
"short form argument" : "-q",
"tasks" : {
"aligner" : "--fastq"
}
},
{
"ID" : "second mate",
"description" : "The file containing the sequence reads for the second mate",
"long form argument" : "--fastq2",
"short form argument" : "-q2",
"required" : true,
"tasks" : {
"aligner" : "--fastq2"
}
}
The node with the ID 'first mate' does need to specified as required as it is set as required in the aligners own configuration file. Since the tool configuration file does not list --fastq2
as required, it needs to be identified as such for the purposes of this pipeline.
{
"ID" : "example",
"description" : "delete files example",
"delete files" : true,
"tasks" : {
"task A" : "--out",
"task B" : "--in"
}
}
The tasks section above instructs the pipeline to link the output of task A to task B. By including the field "delete files" : true"
, gkno is instructed to remove the file when the pipeline no longer needs it. gkno does not wait until the pipeline has been completed to delete the file as this could create the situation with all the intermediate files overwhelming the available memory. The file is also listed at the top of the makefile in the .INTERMEDIATES
section. This ensures that if the pipeline is rerun, this file will not be regenerated unless earlier files have been modified and task B needs to be rerun.
{
"ID" : "example",
"description" : "evaluate command example",
"long form argument" : "--in",
"short form argument" : "-i",
"tasks" : {
"taskA" : "--in"
},
"evaluate command" : {
"command" : "shell head -1 FILE1",
"add values" : [
{
"ID" : "FILE1",
"task" : "taskB",
"argument" : "--out"
}
]
}
}
This node describes the pipeline command line argument --in
. If a value for this is given on the command line, the value will be used. If there is no value supplied, the defined command will be used instead. The command can contain as many unique ID strings as required; each ID is then defined in the add values list. Each dictionary in that list must contain the three fields; ID, task and argument and the ID must be present in the command. In this example, the task taskB
generates the file file.txt from the argument --out
and so the generated makefile will have the following line in the command line for taskA
:
--in $(shell head -1 file.txt)
When the command line for taskA
is executed, file.txt will be interrogated and the resulting value used.
-
--debug (-db)
: prints out messages throughout operation detailing tasks that have been completed. This is useful for identifying sources of error. -
--do-not-execute (-dne)
: is a flag defining whether gkno should execute the scripts after creating them. If not specified, gkno will automatically execute the makefile. This behaviour is overwritten if multiple makefiles are created or if required files/executables are missing. -
--do-not-log-usage (-dnl)
: is a flag that ensures that this usage isn't logged. This is usually used in development to avoid skewing usage stats. -
--draw-pipeline-graph (-dpg)
: defines the name of a file to output a .dot format file that can be plotted using graphviz. -
--export-instance (-ei)
: tells gkno to generate a new instance in the instances configuration file. See the specific tutorial for further information on this option. -
--input-path (-ip)
: the input path for all input files if the path is unspecified. If the path is specified, this path is obviously used, otherwise the assumption is that the files reside in the current working directory. Setting--input-path
will force gkno to assume all unspecified input files (except for resource files, see below) are available in the path specified by--input-path
. -
--instance (-is)
: define the instance to use. This sets a number of parameters for the pipeline. -
--internal-loop (-il)
: only selectable for pipelines with a defined internal loop. This command defines the json file defining multiple sets of input files/parameters. -
--multiple-runs (-mr)
: informs gkno that a json file is provided containing multiple sets of input files/parameters. See the specific tutorial for further information on this option. -
--no-hard-warnings (-nhw)
: is a flag which, if set, removes the requirement for the user to press 'Enter' to provide acknowledgement of an error/warning. There are very few cases where the users response is required, so it is recommended that this is not turned off. -
--number-jobs (-nj)
: the number of parallel jobs to execute. This is only applicable when running a pipeline utilising an internal loop. -
--output-path (-op)
: similar to the input path. All output files are output to the--output-path
unless a path is provided with the filename. -
--task-stdout (-ts)
: if set, each task will generate its own stdout and stderr file. The default behaviour is to produce a single stdout/stderr file for the pipeline. -
--timing (-tm)
: includes the time command for each task in the pipeline. On completion, the timing information for each task is outputted to the stderr. -
--verbose (-vb)
: is a flag used to tell gkno whether to output verbose information to screen as gkno runs.
If included, this section is just a list of tasks that output to the stream. In the makefile, all files contained in this list output to the stream and so will be linked to the next task in the pipeline workflow by a pipe. If consecutive tasks appear in this list, then there will be multiple tasks linked by pipes in a single command line.
For further information on this feature and worked examples demonstrating its use, see the relevant pipeline tutorial.
####Internal loops (optional) It is often the case that a user will want to run a pipeline multiple times for a set of input data. There are two distinct use cases for these pipelines. The first is that the entire pipeline needs to be run from start to finish for a specific input file(s) and the user has a more than one such set of input files. _gkno_ can be set to accept a _json_ file containing the input parameters for each required pipeline execution and a _makefile_ will be created for each set of files. These files can be executed serially or sent to a cluster environment for execution. This use case is covered in more detail in the [_Performing multiple runs of a pipeline_](#tutorial_multiple_runs) section.The second use case uses what we term internal loops. The following figure demonstrated a possible pipeline configuration:
'Task A' is the first task to be executed and its output is used as input to 'Task B', which in turn is executed and produces output files. 'Task C' requires as input, the output from 'Task B', but also some files defined by the user. The internal loop refers to the set of tasks (in this case, tasks C and D) that are run multiple times for different input files, but are all independent of each other. As soon as 'Task B' is complete, as many jobs as required can be spawned in parallel to execute tasks C and D for all defined input files. 'Task E' requires as input the outputs of all of the 'Task Ds' in the internal loop and so is not an independent task and has to wait until all of the 'Task Ds' have been completed.
A real example of a pipeline with this format is fastq-vcf. The first tasks in the pipeline are concerned with preparation of the reference sequence and must be completed prior to any read alignments. The internal loop consists of aligning fastq files and some post-processing (e.g. sorting and indexing) of the bam files. Finally, variant calling is performed using all of the bam files and thus is dependent on all of the tasks in the internal loop being complete. If the user has, for example, three pairs of fastq files, using the internal loop allows the three alignments to be performed in parallel (the number of parallel jobs is controlled using the --number-jobs (-nj)
command line parameter).
To include an internal loop in a pipeline, the "internal loop" section needs to be included in the pipeline configuration file. This section is simply a list of the tools (in the order that they appear in the "workflow") that should be included in the internal loop. With this section defined, the functionality can be accessed using the --internal-loop (-il)
command line argument and providing a json file with the input files/parameters for each iteration of the internal loop. For details on using the internal loop, see the Running a pipeline using the internal loop section.
In addition, the resource data for any particular organism is not static; it goes through modifications as its reference is corrected, new variants are catalogued, and so on. Snapshots of this process are often released over time (e.g. human build 35, 36, 37, ...). The gkno project refers to such a snapshot as a release.
^^ FIXME ^^ How to describe release update process? (not exactly tied to genome release schedule)
gkno's resource-management commands allow the user to manage multiple resources as well as multiple releases for each, if needed. For example:
- Resource A
- Release-1
- Release-2
- Resource B
- Release-1
- Resource C
- Release-4
Note - Any resources & releases referred to in this section are only those that have been bundled and made available by the gkno project. Users may certainly provide their own data files and run analysis using them, but will not be able to manage those files using gkno's resource commands.
### Add a resourcegkno add-resource
will display a list of all genome resources that gkno is hosting and can fetch. Any resources preceded by a '*' have already been added.
gkno add-resource <organism>
will download that organism's current release. The files will be stored under the organism's directory, in a subdirectory matching the release name (e.g. resources/homo_sapiens/build_37). In addition, a symlink (shortcut) named "current" will be created in the organism's directory that points to the current release. This allows the user to refer to "resources/<organism>/current" as the resource path in pipeline scripts and always use the most up-to-date release data. This organism is also considered "tracked" for later updates, see below.
gkno add-resource <organism> --release
will display a list all available releases for that genome. Any releases preceded by a '*' have already been added.
gkno add-resource <organism> --release <release_name>
will download a particular genome release. The files will be stored under the organism's directory, in a subdirectory matching the release name (e.g. resources/homo_sapiens/build_36.2). The "current" release symlink is not created or moved.
### Update a resourceRunning gkno update
will check all tracked resources for new releases. The gkno team realizes that a user may not always wish to automatically update data. Therefore, if any updates are found, a summary message is printed to the screen. The user may type the following command later, at his discretion:
gkno update-resource <organism>
This actually performs the update - downloading the new release files and moving that organism's "current" release symlink to point to it.
The new release may be fetched with out moving the "current" symlink, by using the named-release version of gkno add-resource
described above.
gkno remove-resource
will display a list of all organisms with resource files that be removed.
gkno remove-resource <organism>
will remove all releases for an organism.
gkno remove-resource <organism> --release
will display a list of releases for that organism that can be removed.
gkno remove-resource <organism> --release <release_name>
will remove a particular genome release.
## Available tools The toolkit is dynamic and extra tools can be added by the Marthlab or others (in collaboration with the Marthlab). A list of currently available tools, along with a brief description and links to references are included below: ### Mosaik Mosaik is the Marthlabs sequence read alignment software and comprises multiple elements, each of which are described below. **MosaikBuild** MosaikBuild is used to convert a fasta format reference file into a native format used by the alignment software. Sequence reads themselves also require conversion into a format that the aligner can read. This is also achieved using MosaikBuild.MosaikJump
A hash-based algorithm is used to perform alignments within Mosaik. To facilitate this, a jump database is required. This database is generated using the MosaikJump utility.
MosaikAligner
MosaikAligner description.
### Bamtools Bamtools description. ### Freebayes Freebayes description.