SNP calling #11

hollygene · 2021-05-03T20:51:24Z

Using GATK Best Practices

hollygene · 2021-05-03T20:51:56Z

https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/variantGATK.sh

hollygene · 2021-05-03T20:58:54Z

Step 1: Produce file of accession numbers from all of the sequences we have phenotypic data for

Line 11 in 9dfb027

    
           awk 'NR==FNR {end[$1]; next} ($2 in end)' phenoDataSamples.txt  allSamples.txt > phenoDataSamplesAcc.txt

hollygene · 2021-05-03T21:01:50Z

Step 2: Download the sequences from NCBI

Using:

CornellPostdoc/get_SRR_data.sh

Lines 4 to 9 in 9dfb027

    
           cd /workdir/hcm59/Ecoli/SNPs/GATK_SNP_calling 
        
           fastq-dump --split-3 $1 
        
           # to run: 
        
             # cat /workdir/hcm59/Ecoli/phenoDataSamplesAcc_short.txt| xargs -n 1 bash /workdir/hcm59/CornellPostdoc/get_SRR_data.sh

hollygene · 2021-05-03T21:03:37Z

Step 3: Create unmapped bams from fastqs

Using:

CornellPostdoc/variantGATK.sh

Lines 12 to 28 in 9dfb027

    
           # create a uBAM file 
        
           ####################################################################################### 
        
           for file in ${raw_data}/*_1.fastq 
        
           do 
        
           FBASE=$(basename $file _1.fastq) 
        
           BASE=${FBASE%_1.fastq} 
        
           java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \ 
        
               FASTQ=${raw_data}/${BASE}_1.fastq \ 
        
               FASTQ2=${raw_data}/${BASE}_1.fastq  \ 
        
               OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \ 
        
               READ_GROUP_NAME=${BASE} \ 
        
               SAMPLE_NAME=${BASE} 
        
           done

hollygene · 2021-05-04T21:14:28Z

Step 4: Mark Illumina adapters

CornellPostdoc/variantGATK.sh

Lines 36 to 53 in 41c55b7

    
           # mkdir ${unmapped_bams}/TMP 
        
           # 
        
           # for file in ${unmapped_bams}/*_fastqtosam.bam 
        
           # 
        
           # do 
        
           # 
        
           # FBASE=$(basename $file _fastqtosam.bam) 
        
           # BASE=${FBASE%_fastqtosam.bam} 
        
           # 
        
           # java -jar /programs/picard-tools-2.19.2/picard.jar MarkIlluminaAdapters \ 
        
           # I=${unmapped_bams}/${BASE}_fastqtosam.bam \ 
        
           # O=${unmapped_bams}/${BASE}_markilluminaadapters.bam \ 
        
           # M=${unmapped_bams}/${BASE}_markilluminaadapters_metrics.txt \ 
        
           # TMP_DIR=${unmapped_bams}/TMP \ 
        
           # USE_JDK_DEFLATER=true \ 
        
           # USE_JDK_INFLATER=true 
        
           # 
        
           # done

hollygene · 2021-05-04T21:15:26Z

Step 5: Validate Sam File

CornellPostdoc/variantGATK.sh

Lines 61 to 72 in 41c55b7

    
           # for file in ${unmapped_bams}/*_markilluminaadapters.bam 
        
           # 
        
           # do 
        
           # 
        
           # FBASE=$(basename $file _markilluminaadapters.bam) 
        
           # BASE=${FBASE%_markilluminaadapters.bam} 
        
           # 
        
           # java -jar /programs/picard-tools-2.19.2/picard.jar ValidateSamFile \ 
        
           #       I=${unmapped_bams}/${BASE}_markilluminaadapters.bam \ 
        
           #       MODE=VERBOSE 
        
           # 
        
           # done

hollygene · 2021-05-04T21:15:56Z

Step 6: Convert from Sam to Fastq

CornellPostdoc/variantGATK.sh

Lines 78 to 94 in 41c55b7

    
           # for file in ${unmapped_bams}/*_markilluminaadapters.bam 
        
           # 
        
           # do 
        
           # 
        
           # FBASE=$(basename $file _markilluminaadapters.bam) 
        
           # BASE=${FBASE%_markilluminaadapters.bam} 
        
           # 
        
           # java -jar /programs/picard-tools-2.19.2/picard.jar SamToFastq \ 
        
           # I=${unmapped_bams}/${BASE}_markilluminaadapters.bam \ 
        
           # FASTQ=${unmapped_bams}/${BASE}_samtofastq_interleaved.fq \ 
        
           # CLIPPING_ATTRIBUTE=XT \ 
        
           # CLIPPING_ACTION=2 \ 
        
           # INTERLEAVE=true \ 
        
           # NON_PF=true \ 
        
           # TMP_DIR=${unmapped_bams}/TMP 
        
           # 
        
           # done

hollygene · 2021-05-05T13:52:00Z

Need a reference genome for next step - asking Stanhope/lab slack for recommendations on which version/strain to use

hollygene · 2021-05-05T15:52:43Z

Stanhope recommends using the PANTHER database reference genome

Escherichia coli | E. coli | ECOLI | EnsemblGenome | Reference Proteome 2020_04

https://www.ebi.ac.uk/reference_proteomes/

the E coli reference:

ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/Bacteria/UP000000625_83333.fasta.gz

hollygene · 2021-05-13T16:29:07Z

^ that is actually a proteome so gatk didn't work

Need a GENOME:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/

this opens a folder in Finder
click through Bacteria to get to UDP 83333 (E coli)
choose the "DNA" fasta to download

hollygene · 2021-06-10T21:05:31Z

How to pick the best reference genome?

We are wanting to find SNPs that are associated with particular resistance phenotypes
So ideally the reference would not be resistant to any abx
A true "wild type" genome

However, we could also use a consensus sequence and call SNPs in samples from that
Pros: no need for a possibly very diverged reference sequence
Cons: Our dataset is biased because a lot of them are resistant

could maybe use one of the dog isolates that ISN'T resistant?

hollygene · 2021-06-10T21:10:17Z

WDL notes
WDL: script that describes the workflow
Cromwell: Java-based job scheduler that can use various backend environments
Run mode & server mode

hollygene · 2021-06-10T21:11:55Z

Dockstore
info page
Descriptor file: script in wdl that tells the program what to do, basically
tools: more info on each task + Docker container it is using for that task
test parameters file: file with the input files (can make this manually)
Launch tab: actual commands for running the file

Run locally with Dockstore CLI (can run on your local machine command line)

hollygene · 2021-06-16T19:39:31Z

Decided to choose a reference genome that is from a canine

Genome chosen:
https://www.ncbi.nlm.nih.gov/assembly/GCA_002310695.1#/def
From: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/167/ (filtered by host organism + complete genome)
Strain: 1428
1 chromosome, 4 plasmids

AMR Genotypes:
complete: acrF, blaCMY-2, blaEC,mdtM,tet(B)
point: cyA_S352T

hollygene · 2021-06-16T19:40:12Z

Creating unmapped bams from fastq files

for file in ${raw_data}/*_1.fastq

do

FBASE=$(basename $file _1.fastq)
BASE=${FBASE%_1.fastq}
java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \
    FASTQ=${raw_data}/${BASE}_1.fastq \
    FASTQ2=${raw_data}/${BASE}_2.fastq  \
    OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \
    READ_GROUP_NAME=${BASE} \
    SAMPLE_NAME=${BASE}

done

hollygene created this issue from a note in E_coli_AMR (In progress) May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNP calling #11

SNP calling #11

hollygene commented May 3, 2021

hollygene commented May 3, 2021

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 3, 2021

hollygene commented May 4, 2021 •

edited

Loading

hollygene commented May 4, 2021

hollygene commented May 4, 2021

hollygene commented May 5, 2021 •

edited

Loading

hollygene commented May 5, 2021

hollygene commented May 13, 2021

hollygene commented Jun 10, 2021 •

edited

Loading

hollygene commented Jun 10, 2021

hollygene commented Jun 10, 2021

hollygene commented Jun 16, 2021

hollygene commented Jun 16, 2021

SNP calling #11

SNP calling #11

Comments

hollygene commented May 3, 2021

hollygene commented May 3, 2021

hollygene commented May 3, 2021 • edited Loading

hollygene commented May 3, 2021 • edited Loading

hollygene commented May 3, 2021

hollygene commented May 4, 2021 • edited Loading

hollygene commented May 4, 2021

hollygene commented May 4, 2021

hollygene commented May 5, 2021 • edited Loading

hollygene commented May 5, 2021

hollygene commented May 13, 2021

hollygene commented Jun 10, 2021 • edited Loading

hollygene commented Jun 10, 2021

hollygene commented Jun 10, 2021

hollygene commented Jun 16, 2021

hollygene commented Jun 16, 2021

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 4, 2021 •

edited

Loading

hollygene commented May 5, 2021 •

edited

Loading

hollygene commented Jun 10, 2021 •

edited

Loading