-
Set up Avere vFXT
- Launch the Avere vFXT using ARM or in the Azure portal
- Download https://aka.ms/genomicsworkshop/sample_data as genomics_workshop_data.tar
- The default export mount on Avere is /msazure
- Make a subdir in the export point: /msazure/data
- Expand the genomics_workshop_data.tar into /msazure/data so that the data is in it shows as /msazure/data/genomics_workshop_data
As an alternate to Avere, use the NFS server
/shared
in the cluster below. But the paths in the nextflow scripts and the instructions have to be modified. -
A CycleCloud installation
- Setup CycleCloud in the Avere vFXT subnet
- Clone/Download the HLS-workshop CycleCloud project (https://github.com/jermth/hls-workshop)
- Upload the hls-workshop project in CycleCloud
- Import the cluster template (cyclecloud import_template Genomics-Workshop -f Genomics-Workshop.cluster.txt)
-
Launch the Genomics-Workshop cluster
- The cluster template sets up an SGE cluster
- You'll need to change the Avere IPAddress in the cluster setup, but the other settings don't have to be set
- Leave the Avere mount point as /avere -- the example scripts below work on that assumption
- This cluster has a couple of cluster-init projects defined that does the following:
- For installing docker runtime on each node
- Pulling a couple of biocontainer images for the examples
- Creating an /etc/profile.d script that stages the example netflow files and symlinking
- The exercises illustrate the following:
- How to install packages using Conda. This is all in the user's home directories and available throughout the cluster.
- How to run docker images.
- How to do the the above through nextflow
- How to use nextflow on the SGE
-
Log into the Cluster headnode
-
On first login, the
~/genomics-workshop/
directory is staged for the user -
That directory has a symlink to the human genome index and sample Fastqs
- bowtie2_index/grch38/
- NA12787/
-
Verify the Avere mount there:
ls /avere/data/genomics_workshop_data
-
The
~/genomics-workshop/
directory also carries two example nextflow files.
-
Install conda. This is done for the user account, so run the following as the user:
# Get the installer wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh # Run the installer bash Miniconda2-latest-Linux-x86_64.sh -b # Set up the path and environment echo "source ~/miniconda2/etc/profile.d/conda.sh" >> ~/.bashrc echo "export PATH=~/miniconda2/bin:$PATH" >> ~/.bashrc source ~/miniconda2/etc/profile.d/conda.sh export PATH=~/miniconda2/bin:$PATH
-
Install tools:
# Set up the conda "channels" conda config --add channels bioconda conda config --add channels conda-forge # install bioinformatics tools conda install -y bowtie2 samtools bcftools htop glances nextflow
-
Run a local bowtie job with the conda tools
cd ~/genomics-workshop mkdir exercise2 bowtie2 -x bowtie2_index/grch38/grch38_1kgmaj -1 NA12787/SRR622461_1.fastq.gz -2 NA12787/SRR622461_2.fastq.gz -u 100000 -S exercise2/exercise2.sam # view the file head -n 50 exercise2/exercise2.sam Convert the sam file into bam and sort it samtools view -bS exercise2/exercise2.sam | samtools sort > exercise2/exercise2.bam Generate a VCF from the bam bcftools mpileup --no-reference exercise2/exercise2.bam > exercise2/exercise2.vcf # view the VCF head -n 50 exercise2/exercise2.vcf
-
Run the same set of bowtie and samtools commands, but using Docker containers
# view the docker images on this VM docker image list # Run bowtie2 again, but now using the container instead of the conda install binary cd ~/genomics-workshop mkdir docker_results docker run -u $(id -u ${USER}):$(id -g ${USER}) -v ~/genomics-workshop/docker_results:/results -v /avere/data/genomics_workshop_data:/data biocontainers/bowtie2:v2.3.1_cv1 bowtie2 -x bowtie2_index/grch38/grch38_1kgmaj -1 NA12787/SRR622461_1.fastq.gz -2 NA12787/SRR622461_2.fastq.gz -u 100000 -S /results/exercise3.sam # if you have another window, run docker ps to see the container running docker run -u $(id -u ${USER}):$(id -g ${USER}) -v ~/genomics-workshop/docker_results:/results -v /avere/data/genomics_workshop_data:/data biocontainers/samtools:v1.7.0_cv4 /bin/bash -c "samtools view -bS /results/exercise3.sam | samtools sort -o /results/exercise3.bam"
-
Open and look at the bowtie.local.nf file. This nf file describes a workflow. The processes defined run the same set of commands in exercise2. Now however, the Nextflow workflow manager chains together the outputs.
-
Run the same set of applications locally using nextflow:
nextflow run bowtie2.local.nf
-
Nextflow works with docker conainers as well. Look at the differences between bowtie2.container.nf and bowtie2.local.nf
nextflow run bowtie2.container.nf
-
Finally, use Nextflow with the SGE scheduler by specifying the execution engine in the config file.
-
Using the same NF flow file, the docker tasks are now submitted to the SGE compute nodes.
nextflow -c nextflow.sge.config run bowtie2.container.nf
-
Verify that the job is in queue by using the
qstat -f
command -
Notice also that the autoscaler kicks in and scales up a compute node.