Create and activate the environment specified in environment.yml
conda env create --file environment.yml
conda activate learn2thermDB
pip install .
Ensure that the following are set in order access full functionality
ENV_EMAIL
- The email will be used to access NCBI FTP ins0.0
and Bacdive ins0.1
NCBI_API_KEY
- API key from NCBI. Needed to get 16s sequences ins0.0
. See hereLOGLEVEL
(optional) - Specified logging level to run the package. eg 'INFO' or 'DEBUG'FATCAT_EXEC
- Necessary for struvtura alignment steps of validation,s2.11
ands2.12
Configure DVC as desired. This is not required, however here are a few recommendations:
- Add a remote data repository
- Use hardlinking instead of copying to the DVC cache to increase speed. See here
params.yaml
contains all of the tunable parameters used to run the pipeline. Modifying these parameters will product a different result than in the presented paper. Note that DVC tracks input and output states, thus by changing a parameter, DVC whill know which stages need to be run. The parameters are discussed in the context of their pipeline stages in ./docs/compspec/pipeline_components.md
. The params file itself has comments indicating what the parameter does.
Dask configuration is required for s1.3
. This script stands up a Dask cluster of workers to conduct protein alignment. Each worker runs alignment between a one taxa pair at a time. In order to conduct this step, .config/dask/jobqueue.yaml
must be updated. The pipeline was initially ran using slurm, and the config file is provided, however names and accounts will necessarily need to be changed for your cluster. If using a distributed scheduler other than SLURM, it must be supported by dask and the appropriate configurations made. See here.
If the pipeline is erroring out at this stage, it is likely an issue with the cluster configuration. Common issues experienced are workers not being able to find executables.
Ensure that the workers have appropriate environment setups. If using SLURM, the existing config file is a working example: note the sourceing of bashrc
, environment activation, and environment variable exports in the job preludes/directives.
Data Version Control (DVC) is used to track data, parameters, metrics, and execution pipelines.
To use a DVC remote, see the the documentation.
DVC tracked data, metrics, and models are found in ./data
while scripts and parameters can be found in ./pipeline
. To execute pipeline steps, run dvc repro <stage-name>
where stages are listed below, and details on stagess can be found in ./docs/compspec/pipeline_components.md
:
- get_raw_data_taxa
- get_raw_data_proteins
- get_proteome_mdata
- parse_proteins
- label_taxa
- get_16s_blast_scores
- label_all_pairs
- protein_alignment
- make_database
- get_hait_pairs
- compare_to_Tm
- run_hait_alignment
- compare_hait_alignment
- get_HMM_profiles
- hmmer_hait
- run_hmmer
- parse_hmmer_result
- compare_hait_hmmer
- sample_data_for_structure
- structure_hait
- structure_l2t
Dependancies between stages is shown below:
flowchart TD
node1["../../analysis_pipeline/dvc.yaml:chosen_protein_search_space"]
node2["../../analysis_pipeline/dvc.yaml:filtered_protein_search_space"]
node3["../../analysis_pipeline/dvc.yaml:full_protein_search_space"]
node4["../../dvc.yaml:compare_hait_alignment"]
node5["../../dvc.yaml:compare_hait_hmmer"]
node6["../../dvc.yaml:compare_to_Tm"]
node7["../../dvc.yaml:get_16s_blast_scores"]
node8["../../dvc.yaml:get_HMM_profiles"]
node9["../../dvc.yaml:get_hait_pairs"]
node10["../../dvc.yaml:get_proteome_mdata"]
node11["../../dvc.yaml:get_raw_data_proteins"]
node12["../../dvc.yaml:get_raw_data_taxa"]
node13["../../dvc.yaml:hmmer_hait"]
node14["../../dvc.yaml:label_all_pairs"]
node15["../../dvc.yaml:label_taxa"]
node16["../../dvc.yaml:make_database"]
node17["../../dvc.yaml:parse_hmmer_result"]
node18["../../dvc.yaml:parse_proteins"]
node19["../../dvc.yaml:protein_alignment"]
node20["../../dvc.yaml:run_hait_alignment"]
node21["../../dvc.yaml:run_hmmer"]
node22["../../dvc.yaml:sample_data_for_structure"]
node23["../../dvc.yaml:structure_hait"]
node24["../../dvc.yaml:structure_l2t"]
node7-->node2
node7-->node14
node7-->node16
node7-->node19
node8-->node13
node8-->node21
node9-->node13
node9-->node20
node9-->node23
node10-->node18
node11-->node18
node12-->node3
node12-->node7
node12-->node15
node12-->node16
node12-->node18
node14-->node2
node14-->node16
node14-->node19
node15-->node1
node15-->node7
node15-->node16
node16-->node4
node16-->node5
node16-->node6
node16-->node21
node16-->node22
node17-->node5
node18-->node1
node18-->node2
node18-->node3
node18-->node7
node18-->node16
node18-->node19
node19-->node16
node19-->node17
node20-->node4
node21-->node17
node22-->node24
Note that script execution is expected to occur with the top level as the current working directory, and paths are specified with respect to the repo top level.
The entire pipeline can be run in a single command using dvc exp run
. Note however that many steps are resource intensive. At least 8 cores and 60GB of RAM is recommended. For s1.3
, Dask must be configured. This stage requires a distributed computing cluster.
Installable, importable code used in the pipeline is found in learn2therm
and should be installed given the above steps in the Environment section.