GitHub - gifford-lab/kmm-launcher: kmer model launch scripts

A set of ec2 support scripts for the kmer-model (SCM) which takes a bam and automatically generates fitted output to be placed into amazon's S3 storage system

##Example

Run the command in the git repo root:

docker pull thashim/kmm-launcher
docker run --rm -w `pwd` -v /cluster:/cluster -i thashim/kmm-launcher /kmm/run.onestrand.r  example/nrf.list /cluster/ec2/auth.txt

##Configuring the SCM

auth.txt

realm:us-east-1
price:3.0
region:us-east-1d
rsa_key:/cluster/ec2/starcluster.rsa
access_key:REDACTED
secret_key:REDACTED
key_name:starcluster
bucket_name:kmer_rc2
mailaddr:[email protected]

Useful options: price sets the max bid price, $3 is reasonable. Set too low and your jobs will get killed before completed

region sets the job submit regions, you can check the spot prices of a c3.8xlarge and pick a cheap region

mailaddr sets the email address that gets emailed at the end of a job. The emails will probably get spam-boxed first, so check spam folder.

Optional options: itype sets the instance type: valid alternatives are cc2.8xlarge, to use this you must also change the AMI.

ami sets the AMI type: you will want to use the HVM image (ami-864d84ee) if you use any other instances like cc2.8xlarge or r3.8xlarge.

*.list

Example is available in git, some more are avaialable at /cluster/projects/wordfinder/paper/rlist

The only valid delimiter for a *.list file is 'space' not tab or comma or anything else. This also means you cannot use directories or parameters that have spaces in them.

Example:

#bam.prefix /cluster/projects/wordfinder/bams/
#gbase /cluster/projects/wordfinder/data/genome/
#quality 0
#postfix .nrf_rc1
#bucket_name batch_runs
#maxk 8
#k 1000
#resol 4
#read.max 5
#smooth.window 20
#genome mm10

nrf_wt,nrf_round1/sherwood nodox Nrf1 test 1.bwa.mm10.bam,nrf_round1/sherwood plusdox Nrf1 test 1.bwa.mm10.bam,nrf_round1/sherwood sr3 Nrf1 test 1.bwa.mm10.bam,nrf_round1/sherwood sr8 Nrf1 test 1.bwa.mm10.bam

#smooth.window 10
#quality 20
dnase_1,mes_dnase/D0_175-400_130801.bwa.mm10.bam,mes_dnase/D0_50-100_130801.bwa.mm10.bam

Nearly all options can be overridden in a .list file.

The general format of a .list file is

#variable_name1 value
#variable_name2 value
[...]

experiment_name,bam_1.bam,bam_2.bam [..]

#variable_name1 value
[...]
experiment_name_2,bam_1 [...]

The launcher parses from top to bottom, setting each variable_name to value. Whenever it encounters a line without a # character, it will launch a SCM-job, assuming that the first entry is the experiment name and any following it are bams.

Later variable assignment lines starting with # will override earlier ones. In the example above, nrf_wt launches with a quality parameter of 0, but dnase_1 is launched with quality of 20 due to the later override line.

Common arguments

bam_prefix: the path to where bam files are stored. Launcher will look for bam_prefix+bam_name where bam_name is the name in the experiment_name.

gbase: path to where genome files are stored. do not change if run within gifford lab.

quality: mapper quality cutoff, pick q=0 by default, q=20 if attempting to avoid repeat regions and other hard-to-map regions.

postfix: postfix applied to jobs. Each job will go into a S3 bucket where they are separated into folders named experiment_name+postfix

bucket_name: s3 bucket name. This should generally be your username / project name to avoid mixing multiple people's jobs.

genome: set to the organism genome. Currently only hg19 and mm10 are supported.

Tweakable parameters

maxk: Maximum kmer length to consider, 8 is generally good enough and the start of diminishing returns.

k: the window size. The model looks within a [-k,+k] region around each Kmer match. Should be a multiple of RESOL.

read.max: truncate input at read.max to avoid giant read-spikes from affecting model. Generally 5-10 is good for experiments in the < 1 billion read range

resol: the resolution at which parameters are stored. for example, if K=1000, RESOL=5, then the model fits 200 paremters, each representing 5 bases. RESOL MUST BE ABLE TO DIVIDE K

smooth.window: smooth the input by this many bases before feeding into the model. Useful for low-coverage experiments. Default of 10-20 is fine for all but extreme high or low coverage experiments.

mbsize: optimizer minibatch size, smaller is faster but less stable. Generally set to around 40960000 - 10240000. Most likely best not to touch this.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
example		example
Dockerfile		Dockerfile
README.md		README.md
run.cluster.onestrand.r		run.cluster.onestrand.r
run.onestrand.r		run.onestrand.r
standalone.template.txt		standalone.template.txt
user-data.txt		user-data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auth.txt

*.list

Common arguments

Tweakable parameters

About

Releases

Packages

Languages

gifford-lab/kmm-launcher

Folders and files

Latest commit

History

Repository files navigation

auth.txt

*.list

Common arguments

Tweakable parameters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages