RCC-Utilities

Useful scripts and examples for The University of Chicago's RCC resource (https://rcc.uchicago.edu).

Files and Directories

.bashrc file
- A starter .bashrc file to put in your shell so that you don't accidentally rm *
array_job/
- Run SLURM array jobs against the refseq database
fragment_array_job
- A single Python script to fragment, run array job, and collate a BLASTp job
mpcs56420.yml
- Conda .yml file to install all the modules necessary for the course (and then some)
- Create environment: conda env create --file mpcs56420.yml
- Conda Cheat Sheet
sample_data
- Fasta files for examples
single_node_job
- Run blast on a single node
multiprocessing/
- Python multiprocessing job

Working on the RCC

Logging In

You need to have 2-factor authentication to sign in.

ssh [email protected]

# Useful Commands 
accounts balance

# List files by size
du -mahd 1

Locations

/home - You user home directory
/scratch/midway2/ - Large space for temporary storage
/project2/mpcs56420 - Class project directory that we all have permissions for

Sequence Databases (for reference)

/project2/mpcs56420/db/pdbaa
/project2/mpcs56420/db/swiss_prot
/project2/mpcs56420/db/nr

Setup Ananconda

Warning, the Anaconda on RCC is outdated, but still works.

module load python/anaconda-2019.03    

# May have to add this to your .bashrc
. /software/Anaconda3-5.1.0-el6-x86_64/etc/profile.d/conda.sh

# Create (if needed)
conda create --name mpcs56420

# Activate the env
conda activate mpcs56420

Setup BLAST

Install BLAST (if needed)

conda install -c bioconda blast

Create PDBaa BLAST database (small database)

# Download the database
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/pdbaa.gz
gunzip pdbaa.gz
# Make the database
makeblastdb -in pdbaa -input_type fasta -dbtype prot -out pdbaa

Run BLAST job on the Login Node

Use protein1.fasta as the query on the login node. Only do this for testing. Your account will be suspended if you do too much work on the login node.

blastp -query protein1.fasta -db pdbaa -out test.out

Run BLAST job on Node as Interactive Job

Start an interactive session.

sinteractive -A mpcs56420

Run the BLAST job.

blastp -query protein1.fasta -db pdbaa -out test.out

While you are on a worker node, test that you can read/write to /scratch and /project2/mpcs56420. Work you do on nodes should be using these directories.

touch  /scratch/midway2/cnetid/test.txt
touch /project2/mpcs56420/test.txt

Creating NR and Refseq BLAST db (large database)

Download from NCBI.

# NR database
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

# Refseq Database (https://www.biostars.org/p/130274/)
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.*.protein.faa.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein.*.protein.faa.gz

Unzip and renumber sequentially

gunzip *gz
ls -v | cat -n | while read n f; do mv -n "$f" "refseq.$n"; done 
ls refseq.* | awk '{print "makeblastdb -in "$1" -input_type fasta -dbtype prot -out "$1}' | sh

Create a blast compatible db

# makeblastdb -in pdbaa -input_type fasta -dbtype prot -out blast_pdb

ls nr.* | awk '{print "makeblastdb -in "$1" -input_type fasta -dbtype prot -out "$1}' | sh

Run a blast job

blastp -query protein1.fasta -db /project2/abinkowski/db/blast_pdb -out test3

Split a FASTA database

Useful command to split up any FASTA format database into multiple files.

pyfasta split -n 6 swissprot

Sample Commands

# sinteractive
sinteractive -A mpcs56420
DATABASE="/project2/mpcs56420/db/pdbaa"
OUTFILE="/scratch/midway2/abinkowski/out.txt"
blastp -query spike.fasta -db $DATABASE -out $OUTFILE  -num_threads 1

# Single Node Sleep
sbatch sleep_test.sbatch 

# Single Node BLASTP 
cd RCC-Utilities/single_node_job
sbatch single-node-blastp-simple.sbatch 
more out.txt

# Single Node with 
sbatch single-node-blastp.sbatch testjob

# Multiprocessing
sbatch multi.py

MPI Blast (Deprecated)

Check out this repositories wiki for instructions on running the scripts.

Git Tips and Tricks

git config --global user.name "My Name @RCC" git config --global user.email [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
archive		archive
array_job		array_job
blast		blast
dependency_workflows		dependency_workflows
fragment_array_job		fragment_array_job
hello_world		hello_world
multiprocessing		multiprocessing
sample_data		sample_data
single_node_job		single_node_job
.DS_Store		.DS_Store
.bashrc		.bashrc
.gitignore		.gitignore
.ncbirc		.ncbirc
README.md		README.md
mpcs56430.yml		mpcs56430.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RCC-Utilities

Files and Directories

Working on the RCC

Logging In

Locations

Setup Ananconda

Setup BLAST

Create PDBaa BLAST database (small database)

Run BLAST job on the Login Node

Run BLAST job on Node as Interactive Job

Creating NR and Refseq BLAST db (large database)

Split a FASTA database

Sample Commands

MPI Blast (Deprecated)

Git Tips and Tricks

About

Releases

Packages

Languages

uchicago-bio/RCC-Utilities

Folders and files

Latest commit

History

Repository files navigation

RCC-Utilities

Files and Directories

Working on the RCC

Logging In

Locations

Setup Ananconda

Setup BLAST

Create PDBaa BLAST database (small database)

Run BLAST job on the Login Node

Run BLAST job on Node as Interactive Job

Creating NR and Refseq BLAST db (large database)

Split a FASTA database

Sample Commands

MPI Blast (Deprecated)

Git Tips and Tricks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages