Useful scripts and examples for The University of Chicago's RCC resource (https://rcc.uchicago.edu).
.bashrc
file- A starter .bashrc file to put in your shell so that you don't accidentally
rm *
- A starter .bashrc file to put in your shell so that you don't accidentally
- array_job/
- Run SLURM array jobs against the refseq database
- fragment_array_job
- A single Python script to fragment, run array job, and collate a BLASTp job
- mpcs56420.yml
- Conda .yml file to install all the modules necessary for the course (and then some)
- Create environment:
conda env create --file mpcs56420.yml
- Conda Cheat Sheet
- sample_data
- Fasta files for examples
- single_node_job
- Run blast on a single node
- multiprocessing/
- Python multiprocessing job
You need to have 2-factor authentication to sign in.
ssh [email protected]
# Useful Commands
accounts balance
# List files by size
du -mahd 1
- /home - You user home directory
- /scratch/midway2/ - Large space for temporary storage
- /project2/mpcs56420 - Class project directory that we all have permissions for
Sequence Databases (for reference)
- /project2/mpcs56420/db/pdbaa
- /project2/mpcs56420/db/swiss_prot
- /project2/mpcs56420/db/nr
Warning, the Anaconda on RCC is outdated, but still works.
module load python/anaconda-2019.03
# May have to add this to your .bashrc
. /software/Anaconda3-5.1.0-el6-x86_64/etc/profile.d/conda.sh
# Create (if needed)
conda create --name mpcs56420
# Activate the env
conda activate mpcs56420
Install BLAST (if needed)
conda install -c bioconda blast
# Download the database
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/pdbaa.gz
gunzip pdbaa.gz
# Make the database
makeblastdb -in pdbaa -input_type fasta -dbtype prot -out pdbaa
Use protein1.fasta
as the query on the login node. Only do this for testing. Your account will be suspended if you do too much work on the login node.
blastp -query protein1.fasta -db pdbaa -out test.out
Start an interactive session.
sinteractive -A mpcs56420
Run the BLAST job.
blastp -query protein1.fasta -db pdbaa -out test.out
While you are on a worker node, test that you can read/write to /scratch
and /project2/mpcs56420
. Work you do on nodes should be using these directories.
touch /scratch/midway2/cnetid/test.txt
touch /project2/mpcs56420/test.txt
Download from NCBI.
# NR database
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
# Refseq Database (https://www.biostars.org/p/130274/)
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.*.protein.faa.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/complete.nonredundant_protein.*.protein.faa.gz
Unzip and renumber sequentially
gunzip *gz
ls -v | cat -n | while read n f; do mv -n "$f" "refseq.$n"; done
ls refseq.* | awk '{print "makeblastdb -in "$1" -input_type fasta -dbtype prot -out "$1}' | sh
Create a blast compatible db
# makeblastdb -in pdbaa -input_type fasta -dbtype prot -out blast_pdb
ls nr.* | awk '{print "makeblastdb -in "$1" -input_type fasta -dbtype prot -out "$1}' | sh
Run a blast job
blastp -query protein1.fasta -db /project2/abinkowski/db/blast_pdb -out test3
Useful command to split up any FASTA format database into multiple files.
pyfasta split -n 6 swissprot
# sinteractive
sinteractive -A mpcs56420
DATABASE="/project2/mpcs56420/db/pdbaa"
OUTFILE="/scratch/midway2/abinkowski/out.txt"
blastp -query spike.fasta -db $DATABASE -out $OUTFILE -num_threads 1
# Single Node Sleep
sbatch sleep_test.sbatch
# Single Node BLASTP
cd RCC-Utilities/single_node_job
sbatch single-node-blastp-simple.sbatch
more out.txt
# Single Node with
sbatch single-node-blastp.sbatch testjob
# Multiprocessing
sbatch multi.py
Check out this repositories wiki for instructions on running the scripts.
git config --global user.name "My Name @RCC" git config --global user.email [email protected]