Minerva Docs

MINERVA BASICS

TUTORIAL

Slides:

(please upload slides to GitHub)

LOGIN:

Default: ssh <username>@minerva.hpc.mssm.edu
Specify Login node example: ssh <username>@minerva12.hpc.mssm.edu
Offsite: First login to VPN, then login to Minerva with ssh <username>@minerva.hpc.mssm.edu

COPY FILES

NOTE: Use the "data4" partition to transfer data (it has way faster transfer speeds) Alternatively, you can use the web-based file system interface via Globus Online

Minerva to Local: From your LOCAL terminal, type: scp -r @minerva.hpc.mssm.edu:/sc/path/to/file/or/folder path/to/destination/

Local to Minerva: From your LOCAL terminal, type: scp -r PD_scRNAseq/Results/scRNAseq_results.RData @data4.hpc.mssm.edu:/sc/orga/projects/pd-omics/brian/PD_scRNAseq/Results/

USE GIT

Make sure you're in the LOGIN node (NOT a requested or interactive node, which don't have internet connection) ml git git clone https://github.com/RajLabMSSM/PD_scRNAseq.git

Store username/password so you don't have to type them in each time git config credential.helper store git pull Enter username/password git pull If you want to automatically accept all changes to the repo that others have made, run this: git merge --strategy-option theirs

X11 VIRTUAL MACHINE

Do you find it cumbersome using the terminal all the time? Looking for something closer to how you interact with apps on your desktop (e.g. RStudio)? Interact with Minerva using a virtual machine.

Open a new Terminal in X11 (cmd+N, or Applications=>Terminal in the top bar)

X11 on Chimera (interactive session):

ssh -XY @chimera.hpc.mssm.edu bsub -q interactive -n 4 -W 12:00 -R "rusage[mem=32000]" -P acc_ad-omics -XF -Ip /bin/bash

Combine the above into an alias to be lazy:

alias rstudio_chimera='ssh -XY @minerva13.hpc.mssm.edu -t bsub -q interactive -n 4 -W 12:00 -R "rusage[mem=32000]" -P acc_ad-omics -XF -Ip /bin/bash'

RStudio interface (using X11 from chimera or minerva13):

module load rstudio (ml rstudio) rstudio

JOBS

Interactive session:

Do NOT use -J flag for interactive node (not accepted)

Example: Request 1 core for 5 hours, with 60GB of memory

bsub -P acc_ad-omics -q premium -n 1 -W 24:00 -R span[hosts=1] -R rusage[mem=60000] -Ip /bin/bash

Job Commands

Submit:
bsub < run.lsf Check: bjobs -u Detailed check: bjobs -l Kill: bkill Modify (sneaky) Add two hours to wall time: Bmod -W 120-

Job submission recommendations

Smallest unit of processing and memory on the nodes is 1 core and 3GB of memory. If you need parallel processing or more than 3GB of memory you can request multiples of this base value. Remember that the memory parameter is per core. Therefore to get a job with 4 cores and 16GB of memory total, you request 4 cores and 3000MB of memory.

BODE2 nodes go up to 24 cores and 192GB of memory, so you can in theory ask for 24 cores * 8GB memory per core. If you need a lot of memory consider using the himem nodes, which go up to 1.5GB of memory. To request a himem node add the following to your submission request: “-R himem”

SCREEN SESSIONS

If you’re working on an interactive node and your connection is unstable, you can use SCREEN to keep your session running even if you log out of the cluster.

Start a screen session: screen -S mysession

Detach from your screen session: CTRL + A +D

Reattach to your session: screen -r mysession

Close (definitely) your screen session, once reattached: CTRL + D (or just type exit)

SSH CONFIG FILES

Save literally seconds of time every day with SSH config files.

on your local machine, edit ~/.ssh/config Add this to the file: Host minerva Hostname minerva.hpc.mssm.edu ForwardAgent yes ServerAliveInterval 60 User
enjoy the sweet timesaving power of replacing “ssh [email protected]” with “ssh minerva”. The added benefit is that the ServerAliveInterval command will make sure to keep your SSH connection going by pinging the server every 60 seconds. You can do this with all the different minerva login nodes and give them fun and memorable names.

Downloading files from minerva becomes easy because you just type: scp minerva:/sc/arion/projects/als-omics/ .

CONTROLMASTER Only type your password in once per day!

Create this directory mkdir ~/.ssh/cm_socket chmod 700 ~/.ssh/cm_socket

Add this to your ~/.ssh/config file (the indenting is important). Create/edit your ~/.ssh/config file with an editor like nano, and add the following:

Host * ControlPath ~/.ssh/cm_socket/%r@%h:%p ControlMaster auto ControlPersist 5m x Now you only type your password in the first time you log in. As long as you keep that session open you can open more terminals and log on to Minerva without typing in your password again. Everything is now routed through the original connection (the control master). The ControlPersist parameter means that if you have to close the original terminal for whatever reason, the connection will stay open for 5 minutes before closing completely.

SSHFS SSHFS - mount minerva to access files like they’re on your local machine Shamelessly copied from the Goate lab:

This allows you to access minerva from your work or home Macintosh computer without copying files locally. It will only work if you are an administrator. Otherwise, you will need to change some things and log in from the administrator account.

Installation of SSHFS on a mac: Download OSXFUSE and SSHFS from osxfuse.github.io Install OSXFUSE (Disconnect macbook from everything first, and don't forget to allow the KEXT) Install SSHFS 2.5.0 Run the following commands (see here for Mac users):

sudo mkdir /sc sudo mkdir /hpc sudo chown ${USER}:staff /sc sudo chown ${USER}:staff /hpc

Test with this command:

sshfs [email protected]:/sc /sc/ -o allow_other -o noappledouble -o volname=minerva -o follow_symlinks

If you want to have these folders appear automatically on your Desktop, you can then create a symlink that will appear when you’ve mounted the volume:

ln -s /sc ~/Desktop

If you can’t seem to access the project folders you’re using, you may need to add the equivalent group on minerva to your local machine. For example, the rajt01a group has the group id (gid) 31473, which you can find by typing “id” in minerva. sudo dscl . -create /Groups/rajt01a sudo dscl . -create /Groups/rajt01a gid 31473 sudo dscl . -create /Groups/rajt01a GroupMembership $USER

Towfique has assigned pd-omics, ad-omics, als-omics access to specific groups. So I had to do this to access als-omics: sudo dscl . -create /Groups/als-omics sudo dscl . -create /Groups/als-omics gid 31783 sudo dscl . -create /Groups/als-omics GroupMembership $USER

Again for accessing ad-omics: sudo dscl . -create /Groups/ad-omics sudo dscl . -create /Groups/ad-omics gid 31498 sudo dscl . -create /Groups/ad-omics GroupMembership $USER

Again for accessing pd-omics: sudo dscl . -create /Groups/pd-omics sudo dscl . -create /Groups/pd-omics gid 31713 sudo dscl . -create /Groups/pd-omics GroupMembership $USER

So to set this up each time you log in, put this function in your ~/.bash_profile or ~/.bashrc. This assumes you’ve set an ssh config above with the hostname “minerva”:

minervaMount(){ diskutil unmount force /sc diskutil unmount force /hpc sshfs minerva:/sc /sc -o allow_other -o noappledouble -o volname=minerva_sc -o follow_symlinks sshfs minerva:/hpc /hpc -o allow_other -o noappledouble -o volname=minerva_home -o follow_symlinks }

If you don’t already have your hostname set up, you can instead use the full ssh path. You can also simultaneously create symlinks to specific subfolders so you don’t have to navigate all the way through the directories to get to them. To do this, you will first need to repeat the first steps for these subfolders:

sudo mkdir /pd-omics sudo mkdir /ad-omics sudo chown ${USER}:staff /pd-omics sudo chown ${USER}:staff /ad-omics

Then, in your ~/.bash_profile:

minervaMount(){ diskutil unmount force /sc diskutil unmount force /pd-omics diskutil unmount force /ad-omics sshfs [email protected]:/sc /sc -o allow_other -o noappledouble -o volname=minerva_sc -o follow_symlinks sshfs [email protected]:/sc/orga/projects/pd-omics /pd-omics -o allow_other -o noappledouble -o volname=minerva_home -o follow_symlinks [email protected]:/sc/orga/projects/ad-omics /ad-omics -o allow_other -o noappledouble -o volname=minerva_home -o follow_symlinks }

CREATE ALIASES

Tired of writing out the long path name to your favorite folder? Add an alias (shortcut) to your .bash_profile in Minerva (or really any computer)

Login to Minerva: ssh @minerva.hpc.mssm.edu
Open your .bash_profile with your favorite text editor: nano ~/.bash_profile
Add the following text:

Get the aliases and functions

if [ -f ~/.bashrc ]; then . ~/.bashrc fi

User specific environment and startup programs

PATH=$PATH:$HOME/bin export PATH

Create your aliases:

Customization

alias pd-omics='cd /sc/orga/projects/pd-omics' alias ad-omics='cd /sc/orga/projects/ad-omics' alias wong='cd /hpc/users/wongg05' alias work='cd /sc/orga/work/schilb03' alias scratch='cd /sc/orga/scratch/schilb03' alias PD_scRNAseq='cd /sc/orga/projects/pd-omics/brian/PD_scRNAseq' alias Fine_Mapping='cd /sc/orga/projects/pd-omics/brian/Fine_Mapping'

Save and close your .bash_profile
type to load your aliases into the environment source ~/.bash_profile

CONDA ENVIRONMENTS Instructions to run conda environments in scripts on Chimera. Functions are not exported by default to be made available in subshells, so you will need the following lines into your shell script:

ml anaconda3 # Load anaconda CONDA_BASE=$(conda info --base) source $CONDA_BASE/etc/profile.d/conda.sh conda activate my_conda_env # Load your environment

SNAKEMAKE - pipeline like a pro Currently set up on minerva using the instructions below:

Edit your ~/.bashrc ml anaconda3

Create new file ~/.condarc and add this: channels:

anaconda
defaults
bioconda
conda-forge

envs_dirs:

/sc/orga/work/${USER}/conda/envs

pkgs_dirs:

/sc/orga/work/${USER}/conda/pkgs

In the terminal type: conda create -n default python=3 snakemake conda init bash

Running snakemake on Chimera:

Create a cluster.yaml which describes the resources required by your pipeline.

Example: default: queue: premium cores: 1 mem: 3750 time: '60' name: $(basename $(pwd)):{rule}:{wildcards} output: logs/{rule}:{wildcards}.stdout error: logs/{rule}:{wildcards}.stderr bigRule: time: '120:00' mem: 3750 cores: 4

So here every rule (the default) will be queued on a node with 1 core and 3750MB of memory (asking for 4000 won’t get you on a 4GB node). However the rule “bigRule” needs 4 cores and with ~4GB per core and for 2 hours.

Create an executable with this in it:

set -e

if [ ! -d "cluster" ]; then mkdir cluster fi

curdir="$(pwd)" jname="$(basename $curdir)"

bsub=("bsub -K -J $jname:{rule}:{wildcards}" "-q {cluster.queue} " "-n {cluster.cores} -R "span[hosts=1] select[mem>{cluster.mem}]" "rusage[mem={cluster.mem}]" -W {cluster.time} -L /bin/bash" "-oo cluster/{rule}:{wildcards}.stdout" "-eo cluster/{rule}:{wildcards}.stderr < ")

snakemake -u ../cluster.yaml --cluster-sync "${bsub[*]}"
--local-cores 4 --max-jobs-per-second 5 "$@" --jobs 100
-s ../Snakefile --configfile config.yaml

This creates a folder $PWD/cluster where the log files will go. It then executes your snakemake workflow with the correct parameters for Chimera. Each rule will be submitted to a different node using the parameters set in cluster.yaml

ARCHIVING DATA

Basically there are two options here: Backup and Archiving.

Backups protect against file damage or loss that can occur through accidental deletion, corruption, or disk crashes. The server maintains one or more backup versions for each file that you back up. Older versions are deleted as newer versions are made. Backups can be set incrementally. For backup, if you delete your original files, the backup data will be removed after 30 days. You can set up a cron to connect to the backup node, it will validate the files between the tape and on arion, and will keep certain copies of the updated files. Basically, it will keep a most recent version of files, and 14 older versions for 90 days. It also keeps 2 versions of deleted files for 30 days, therefore, if you deleted the original files on Minerva. The next time you make a connection to TSM, the corresponding files on tape will also be deleted when it is 30 days old.

Policy for backup is as follows: If a file exists on Minerva, 14 versions will be retained. If a file is deleted from Minerva, 2 versions will be retained. If more than one version is on backup the older versions will be deleted after 30 days. If only one version remains on backup it will be kept for 90 days and then deleted.

Archiving copies are saved for long-term storage. Archives are useful if you must go back to a particular version of your files, or you want to delete a file from your workstation and retrieve it later, if necessary. You can safely remove your files from Minerva after you have archived on the TSM, the data stays in tape for 6 years.

You might need to request authorization for your user to be able to use the archive. Send an email to HPC.

Some recommendations, before using it: Run on a screen session Try to reduce the number and size of files archived by compressing your files into tar.gz files

First load java: ml java

To archive only one file: dsmc archive -se=viallr01
-description="Some description 1"
/sc/arion/projects/ad-omics/ricardo/MyArchives/ensembl_list.txt -sub=no

To archive only a folder, not recursive (remember to use / in the end): dsmc archive -se=viallr01
-description="Some description 2"
/sc/arion/projects/ad-omics/ricardo/MyArchives/fastqtl/ -sub=no

To archive a folder recursively: dsmc archive -se=viallr01
-description="Some description 3"
/sc/arion/projects/ad-omics/ricardo/MyArchives/fastqtl/ -sub=yes

To query archived files: dsmc q archive -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ -sub=yes

To query filespaces for a user (e.g. viallr01). That way it may be easier to find archived files in multiple partitions: dsmc query filespace -se=viallr01

To restore archived files is similar to archiving. To restore in the original path (user is asked is want to replace existing files or not): dsmc retrieve -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ensembl_list.txt

To restore in a different path (remember to use / in the end if it’s a folder): dsmc retrieve -se=viallr01
/sc/arion/projects/ad-omics/ricardo/MyArchives/ensembl_list.txt
restored/

It is also possible to restore a file based on descriptions: dsmc retrieve -sub=yes -se=viallr01
-description="Some description 2" "*"
restored/

To delete archived files (-sub=yes deletes all files below a specified directory): dsmc delete archive -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ -sub=yes

To create a backup (“incr” which stands for “incremental backup”): dsmc incr -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ -sub=yes

To query backed-up files: dsmc q backup -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ -sub=yes

To restore backed-up files: dsmc restore -sub=yes -se=viallr01 /sc/arion/projects/ad-omics/ricardo/MyArchives/ restored/

It’s also possible to use GUI, instead of command line, for these functions (not recommended): dsmj -se=viallr01

SRA

Downloading sequence data from the Sequence Read Archive (SRA)

Any paper that publishes results from sequencing data has to make the data publicly available in a repository, unless it’s human patient data, then it gets tricky. However, lots of human data is available. The most popular repository is the sequence read archive (SRA) hosted by the NIH NCBI. Any paper that gives an accession with the form SRR, SRA, or GEO has data stored there.

Step 1. Finding the files This is a paper whose data I’m interested in. First I Ctrl + F for “GEO”:

Then I click the link to the Gene Expression Omnibus and look for a further link to the SRA:

The Run Selector gives accession numbers and metadata for all sequencing files deposited for a particular study.

Here I’m only interested in the RNA-seq files here, not the ATAC-seq. I can then select the files I want and download the RunInfo table (the metadata), and the Accession List (a single column of file IDs).

Now log into minerva and then ssh into minerva2 ssh minerva2

This node for some reason is the only one that works with sratoolkit.

Copy both the accessions and the runtable to a folder on chimera. Make sure you have enough space to download the files you want.

In my folder on chimera I create two subfolders, fastq/ and tmp/.

The NCBI provide a software toolkit called sratoolkit to aid in downloading of raw data.

The command-line tool fasterq-dump downloads FASTQs from the SRA much more quickly than its predecessor, fastq-dump, partly due to multithreading.

I write a script that loads the sratoolkit and iterates through the accessions file to download each file separately. It asks for 4 cores.

module load sratoolkit/2.9.2

for i in $(cat accessions.txt ) ; do echo $i fasterq-dump $i -t tmp/ -O fastq/ -e 4 Done

This will then download the FASTQs one at a time. It may complain and throw errors but it should work.

The files should then be gzipped to save space. Pigz is a parallel version of gzip

ml pigz pigz --processes 4 *fastq

R stuff

KNIT RMARKDOWN FILES

Load necessary modules:

module load R pandoc/1.19.2.1

2a) Without extra R script (specify file names/subfolder manually)

Rscript -e "rmarkdown::render(input='run_seurat.Rmd',output_file='PD_scRNAseq_data-Full_resolution-2.0.html', params = list(subsetGenes= 'protein_coding', 'subsetCells'=500, 'resolution'=0.6));"

2b) With extra R script (automatically name file/subfolder based on parameters)

Rscript Run_scRNAseq.R 'protein_coding' F 0.001 30 4

Installing R packages from Github

To do this you’ll need to install to a local R library as we can’t install packages system-wide.

install.packages() will find your local library automatically but remotes::install_github() does not.

Hint - remotes is much quicker to install than devtools

To get around this you have to tell R to try installing it to your local R library first, rather than the minerva library.

Create ~/.Renviron

Type:

R_LIBS=${R_LIBS_USER}:${R_LIBS}

Then in R:

install.packages(“remotes”) remotes::install_github(“github.com/repoName”) <- put the repo name in here!

Done

GPU WIZARDRY

Using GPUs can speed up some analyses 100-fold+!

As of March 2021, HPC has 20 GPU nodes. 12 nodes with a total of 48 V100 GPUs and 8 nodes with a total of 32 A100 GPUs. (A100s are more powerful versions with higher clock speed and more RAM per GPU).

The number of CPU cores per node may vary between 20-48 standard cores, also the amount of RAM can go from 128-384Gb, but each node will have 4 GPUs each. And each GPU will have thousands of cores (A100 = 6912 CUDA cores, V100 = 5120).

So it’s possible to submit up to 4 GPU jobs simultaneously using the flag -R rusage[ngpus_excl_p=1].

GPU queues have a wall time limit of 144 hours (same as premium queues).

Request an interactive GPU session:

bsub -q interactive -P acc_ad-omics -W 12:00 -n 1 -R v100 -R “rusage[ngpus_excl_p=1]” -Ip /bin/bash

To use X11 forwarding, add -XF to the above command.

Google Cloud

Downloading data from Google Cloud

Big datasets like GTEx/AMP-PD host their big data on Google Cloud, where you have to pay for the data you download. You can get a free trial from Google Cloud.

Set up a project for billing to.

Install google cloud software developer kit to your home directory or a project directory.

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-319.0.0-linux-x86_64.tar.gz

tar -xzf google-cloud-sdk-319.0.0-linux-x86_64.tar.gz

Set up the tools - log in to your google cloud account through the command line. ./bin/gcloud init

Or alternatively, use minerva module: ml google_cloud_sdk

Downloading stuff - make sure to include your project name (cp flags will work normally) gsutil -u ${BILLING_PROJECT_ID} cp gs://bucket-name/folder/file .

Rstudio server

Running RStudio on the web through Minerva

At last the day has come! We can now use the web version of RStudio where the backend computations are run on Minerva, allowing us to program interactively with much larger resources than our own laptops.

This assumes you’re using R 4.0.3.

Logging in (run from minerva login node): minerva-rstudio-web-r4.sh -n 1 -M 32000 -W 12:00 -P acc_ad-omics -q express -i /sc/arion/projects/ad-omics/rstudio_server/singularity-rstudio_r403_centos7.simg

Here you can adjust the number of cores (-n) the memory per core (-M), the length of time for your job (-W) the billing account (-P) and the queue (-q).

Note that in the first you run the script you need to set up a password that will be used to access Rstudio from the browser. Username will be the same as minerva, the password can be different.

Installing packages:

Tip: The image they use doesn’t contain all the packages that the R module does, so you have to install them yourself. You’ll need to create a file called ~/.Renviron with the following lines:

R_LIBS_USER=<path/to/libfolder> R_LIBS=${R_LIBS}:${R_LIBS_USER}

You can create an R library folder in your home directory or in one of the project directories (recommended). This means that R will use that folder to install packages to.

Within the RStudio web browser, go to the shell terminal (see screenshot) and type:

export http_proxy=http://172.28.7.1:3128 export https_proxy=http://172.28.7.1:3128 export all_proxy=http://172.28.7.1:3128 export no_proxy=localhost,.hpc.mssm.edu,.chimera.hpc.mssm.edu,172.28.0.0/16 R

Then within the R session within the shell session (confusing right?) type:

install.packages(name_of_package)

Or however you install the package. You may need to restart R (from Rstudio interface, “Session” > ”Restart R”) to get the installed package working.

More about can be found here: https://labs.icahn.mssm.edu/minervalab/rstudio-web/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly