Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNP calling #11

Open
hollygene opened this issue May 3, 2021 · 15 comments
Open

SNP calling #11

hollygene opened this issue May 3, 2021 · 15 comments
Projects

Comments

@hollygene
Copy link
Owner

Using GATK Best Practices

@hollygene hollygene created this issue from a note in E_coli_AMR (In progress) May 3, 2021
@hollygene
Copy link
Owner Author

hollygene commented May 3, 2021

Step 1: Produce file of accession numbers from all of the sequences we have phenotypic data for

awk 'NR==FNR {end[$1]; next} ($2 in end)' phenoDataSamples.txt allSamples.txt > phenoDataSamplesAcc.txt

@hollygene
Copy link
Owner Author

hollygene commented May 3, 2021

Step 2: Download the sequences from NCBI

Using:

cd /workdir/hcm59/Ecoli/SNPs/GATK_SNP_calling
fastq-dump --split-3 $1
# to run:
# cat /workdir/hcm59/Ecoli/phenoDataSamplesAcc_short.txt| xargs -n 1 bash /workdir/hcm59/CornellPostdoc/get_SRR_data.sh

@hollygene
Copy link
Owner Author

Step 3: Create unmapped bams from fastqs

Using:

# create a uBAM file
#######################################################################################
for file in ${raw_data}/*_1.fastq
do
FBASE=$(basename $file _1.fastq)
BASE=${FBASE%_1.fastq}
java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \
FASTQ=${raw_data}/${BASE}_1.fastq \
FASTQ2=${raw_data}/${BASE}_1.fastq \
OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \
READ_GROUP_NAME=${BASE} \
SAMPLE_NAME=${BASE}
done

@hollygene
Copy link
Owner Author

hollygene commented May 4, 2021

Step 4: Mark Illumina adapters

# mkdir ${unmapped_bams}/TMP
#
# for file in ${unmapped_bams}/*_fastqtosam.bam
#
# do
#
# FBASE=$(basename $file _fastqtosam.bam)
# BASE=${FBASE%_fastqtosam.bam}
#
# java -jar /programs/picard-tools-2.19.2/picard.jar MarkIlluminaAdapters \
# I=${unmapped_bams}/${BASE}_fastqtosam.bam \
# O=${unmapped_bams}/${BASE}_markilluminaadapters.bam \
# M=${unmapped_bams}/${BASE}_markilluminaadapters_metrics.txt \
# TMP_DIR=${unmapped_bams}/TMP \
# USE_JDK_DEFLATER=true \
# USE_JDK_INFLATER=true
#
# done

@hollygene
Copy link
Owner Author

Step 5: Validate Sam File

# for file in ${unmapped_bams}/*_markilluminaadapters.bam
#
# do
#
# FBASE=$(basename $file _markilluminaadapters.bam)
# BASE=${FBASE%_markilluminaadapters.bam}
#
# java -jar /programs/picard-tools-2.19.2/picard.jar ValidateSamFile \
# I=${unmapped_bams}/${BASE}_markilluminaadapters.bam \
# MODE=VERBOSE
#
# done

@hollygene
Copy link
Owner Author

Step 6: Convert from Sam to Fastq

# for file in ${unmapped_bams}/*_markilluminaadapters.bam
#
# do
#
# FBASE=$(basename $file _markilluminaadapters.bam)
# BASE=${FBASE%_markilluminaadapters.bam}
#
# java -jar /programs/picard-tools-2.19.2/picard.jar SamToFastq \
# I=${unmapped_bams}/${BASE}_markilluminaadapters.bam \
# FASTQ=${unmapped_bams}/${BASE}_samtofastq_interleaved.fq \
# CLIPPING_ATTRIBUTE=XT \
# CLIPPING_ACTION=2 \
# INTERLEAVE=true \
# NON_PF=true \
# TMP_DIR=${unmapped_bams}/TMP
#
# done

@hollygene
Copy link
Owner Author

hollygene commented May 5, 2021

Need a reference genome for next step - asking Stanhope/lab slack for recommendations on which version/strain to use

@hollygene
Copy link
Owner Author

Stanhope recommends using the PANTHER database reference genome

Escherichia coli | E. coli | ECOLI | EnsemblGenome | Reference Proteome 2020_04

https://www.ebi.ac.uk/reference_proteomes/

the E coli reference:

ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/Bacteria/UP000000625_83333.fasta.gz

@hollygene
Copy link
Owner Author

^ that is actually a proteome so gatk didn't work

Need a GENOME:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/

  • this opens a folder in Finder
  • click through Bacteria to get to UDP 83333 (E coli)
  • choose the "DNA" fasta to download

@hollygene
Copy link
Owner Author

hollygene commented Jun 10, 2021

How to pick the best reference genome?

We are wanting to find SNPs that are associated with particular resistance phenotypes
So ideally the reference would not be resistant to any abx
A true "wild type" genome

However, we could also use a consensus sequence and call SNPs in samples from that
Pros: no need for a possibly very diverged reference sequence
Cons: Our dataset is biased because a lot of them are resistant

  • could maybe use one of the dog isolates that ISN'T resistant?

@hollygene
Copy link
Owner Author

WDL notes
WDL: script that describes the workflow
Cromwell: Java-based job scheduler that can use various backend environments
Run mode & server mode

@hollygene
Copy link
Owner Author

Dockstore
info page
Descriptor file: script in wdl that tells the program what to do, basically
tools: more info on each task + Docker container it is using for that task
test parameters file: file with the input files (can make this manually)
Launch tab: actual commands for running the file

  • Run locally with Dockstore CLI (can run on your local machine command line)

@hollygene
Copy link
Owner Author

Decided to choose a reference genome that is from a canine

Genome chosen:
https://www.ncbi.nlm.nih.gov/assembly/GCA_002310695.1#/def
From: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/167/ (filtered by host organism + complete genome)
Strain: 1428
1 chromosome, 4 plasmids

AMR Genotypes:
complete: acrF, blaCMY-2, blaEC,mdtM,tet(B)
point: cyA_S352T

@hollygene
Copy link
Owner Author

Creating unmapped bams from fastq files

for file in ${raw_data}/*_1.fastq

do

FBASE=$(basename $file _1.fastq)
BASE=${FBASE%_1.fastq}
java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \
    FASTQ=${raw_data}/${BASE}_1.fastq \
    FASTQ2=${raw_data}/${BASE}_2.fastq  \
    OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \
    READ_GROUP_NAME=${BASE} \
    SAMPLE_NAME=${BASE}

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
E_coli_AMR
  
In progress
Development

No branches or pull requests

1 participant