Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

hollygene · 2021-04-08T15:24:37Z

No description provided.

hollygene · 2021-04-21T20:39:15Z

With help from Kristina:
~~Roary~~ Panaroo was ran on the 601 sequences we have phenotypic data for
Scoary was then run on the ~~Roary~~ Panaroo output + a tree generated by IQTree

Scoary outputs a separate .csv file for each antibiotic (phenotypic class)

hollygene · 2021-04-21T20:39:52Z

Example of what this .csv file looks like

hollygene · 2021-04-21T20:44:23Z

So what do we want to know, and how can we answer these questions?

Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
How many genes are significantly associated with antibiotic resistance, per antibiotic class tested?
How many genes are associated with more than one antibiotic/what are they?
What are the functional annotations of the significantly associated genes?
*this will likely need to be combined with treeWAS results - Allele frequency assoc with each resistance gene?

hollygene · 2021-04-21T20:45:54Z

For question 1:

Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
Need: scoary output + AMRFinder results + maybe other databases?

~~- using our AMRFinder results as a "reference," grep the column Non_unique_gene_name and get non-matches~~

Get list of gene IDs, and search AMRFinder/other databases for these IDs

hollygene · 2021-04-28T19:12:09Z

Need to get gene IDs of accessory genes:

find all protein sequences for acc genes, select one representative for each gene
write multi fasta file for the protein sequences
use blast-p to assign gene IDs to each protein seq
use blast2go or PANTHER for functional annotation

hollygene · 2021-04-28T19:14:22Z

odds ratio: whether it is correlated with 1 (resistant) or 0 (susceptible)
greater than 1: significantly associated with resistance
less than 1: significantly associated with susceptibility

hollygene · 2021-05-03T20:46:13Z

get protein sequences for all accessory genes
write multi fasta file for the protein sequences

Used script from Kristina:

CornellPostdoc/panaroo_protein_fasta_out_kristina.R

Line 1 in 9dfb027

# file usage in R:

Input:

"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv"
"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_presence_absence_roary.csv"

Output: /Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/dogEcoli_acc_proteins_out.fasta

hollygene · 2021-05-03T20:49:23Z

use blast-p to assign gene IDs to each protein sequence

CornellPostdoc/blastp.sh

Line 1 in 9dfb027

#blastp on server

Input: dogEcoli_acc_proteins_out.fasta (from R script above)
Output: dog_verified_host_prots_tab.out (in tab-delimited format)

hollygene · 2021-05-04T21:11:10Z

blastp specific command:

CornellPostdoc/blastp.sh

Lines 11 to 12 in 41c55b7

    
           blastp -outfmt "6 qseqid sseqid sallseqid qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore" \ 
        
           -query ./dogEcoli_acc_proteins_out.fasta -db swissprot -out ./dog_verified_host_prots_tab_oneSeq.out -num_threads 36 -max_target_seqs 1

CornellPostdoc/blastp.sh

Lines 14 to 29 in 41c55b7

    
           # Outputs: 
        
           # qseqid means Query Seq-id 
        
           # sseqid means Subject Seq-id 
        
           # sallseqid means All subject Seq-id(s), separated by a ';' 
        
           # qaccver means Query accesion.version 
        
           # saccver means Subject accession.version 
        
           # pident means Percentage of identical matches 
        
           # length means Alignment length 
        
           # mismatch means Number of mismatches 
        
           # gapopen means Number of gap openings 
        
           # qstart means Start of alignment in query 
        
           # qend means End of alignment in query 
        
           # sstart means Start of alignment in subject 
        
           # send means End of alignment in subject 
        
           # evalue means Expect value 
        
           # bitscore means Bit score

Then filter in bash

hollygene · 2021-05-04T21:11:24Z

Bash filtering:
Wanted to get the best hit for each unique gene
sort by field one and field two (numeric, reverse) so that min for each key will be top of the group,
pick the first for each key by the second sort.

CornellPostdoc/blastp.sh

Line 33 in 41c55b7

    
           sort -k1,1 -k2,2n file | sort -u -k1,1 dog_verified_host_prots_tab_more.out > test.txt

hollygene · 2021-05-04T21:13:02Z

From this file, I took the first two columns (qseqid and sseqid) and pasted them into Excel. I used text to columns to separate sseqid by _, then deleted everything except gene ID and species. I then renamed the columns "PanGene" "GeneID" and "Org"

hollygene · 2021-05-10T20:52:02Z

I actually redid this a different way because I wasn't confident that the first way I did it was correct

so I did this:

sort -k1,1 -k15,15nr -k14,14n dog_verified_host_prots_tab_more.out > test1.txt
sort -u -k1,1 test1.txt > test.txt

The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think).
The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

hollygene · 2021-05-10T20:57:05Z

To analyze Scoary output, I'm using R

I first loaded in all of the Scoary output .csv files into one list in R

CornellPostdoc/scoaryViz.Rmd

Line 22 in c1e8c44

    
           dataFiles <- sapply(Sys.glob("./*.csv"), read.csv, simplify = FALSE, USE.NAMES = TRUE)

I then filtered by Empirical p value (indicating the gene is significantly associated with something) cutoff <0.05

CornellPostdoc/scoaryViz.Rmd

Line 54 in c1e8c44

    
           sigGenes <- sapply(dataFiles,FUN = function(x) subset(x, Empirical_p<0.05 ),simplify = FALSE,USE.NAMES = TRUE)

Then I filtered based on Odds ratio > 1 (indicating the gene is associated with resistance)

CornellPostdoc/scoaryViz.Rmd

Line 55 in c1e8c44

    
           sigResistAssoc <- sapply(sigGenes,FUN = function(x) subset(x, Odds_ratio>1 ),simplify = FALSE,USE.NAMES = TRUE)

hollygene · 2021-05-10T20:58:06Z

I created a function to take an antibiotic as input and spit out a fasta file of all of the nucleotide sequences of the genes that are significantly associated with resistance

CornellPostdoc/scoaryViz.Rmd

Lines 148 to 152 in c1e8c44

    
           get_nt_fasta <- function(antibiotic) { 
        
            x <- inner_join(gene_pres_abs, antibiotic,by=c("Gene" = "Gene")) 
        
            y <- get_prot_seq("/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv",x,T) 
        
           write_fasta_out(g, paste("/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/",antibiotic,".fasta")) 
        
           }

hollygene · 2021-05-11T17:38:36Z

We found that several of the antibiotics tested had no significantly associated genes from Scoary and the reason behind this was the sample size was too low.

We quantified the sample sizes for each antibiotic:

Antibiotic	Phenotypic Datapoints	# Sig Assoc Genes from Scoary
Oxacillin.INT	1	-
Polymyxin.B.INT	1	-
Amoxicillin.INT	9	0
Penicillin.G.INT	9	0
Oxacillin...2..NaCl.INT	10	0
Penicillin.INT	11	0
Neomycin.INT	16	0
Piperacillin.INT	16	117
Tobramycin.INT	17	34
Nitrofurantoin.INT	25	0
Clindamycin.INT	27	0
Erythromycin.INT	34	0
Ceftiofur.INT	37	278
Cephalexin.INT	37	2
Ticarcillin.Clavulanic.Acid.INT	48	220
Ticarcillin.INT	51	160
Cephalothin.INT	63	76
Cefoxitin.INT	96	216
Pradofloxacin.INT	150	427
Cefovecin.INT	272	680
Cefalexin.INT	277	619
Ceftazidime.INT	323	424
Piperacillin.Tazobactam.INT	346	157
Orbifloxacin.INT	387	649
Doxycycline.INT	416	895
Marbofloxacin.INT	445	781
Cefpodoxime.INT	448	780
Chloramphenicol.INT	452	434
Cefazolin.INT	461	816
Imipenem.INT	474	235
Amikacin	497	312
Enrofloxacin.INT	503	610
Ampicillin.INT	509	894
Tetracycline.INT	527	1117
Amoxicillin.Clavulanic.Acid.INT	531	750
Gentamicin.INT	560	573
Trimethoprim.Sulfamethoxazole.INT	586	704

hollygene · 2021-05-11T17:39:20Z

From this, we decided that our cutoff would be 100 samples per antibiotic minimum, that leaves us with 19 antibiotics

hollygene · 2021-05-19T17:51:17Z

Cefalexin and Cephalexin are both included in the antibiotics, but actually are the same thing just different spellings, so I combined them

hollygene · 2021-05-20T13:54:32Z

Want to find what common genes were found between Scoary and AMRFinder, so using the AMRFinder output from the Bioprojects
#4

But, several samples were missing from the original output of that (57 samples)
So created a list of those accession numbers and pulled them from NCBI
Ran AMRFinder on those

Now concatenating that output with the original AMRFinder output (the 554 samples we did have) to get a final list of AMRFinder output for comparison with Scoary outputs

hollygene · 2021-05-20T14:31:33Z

One sample was left off because it didn't have an accession number
This seems to be because it had been through several sequencing rounds

The sample:

6	3	PA-Ryan	2018-07-09	5135313		HGAP v. 4.0		Complete Genome		562	NA	NA	NA	NA	562					4857938		3.8.4	PRJNA324573	COMBINED	2020-06-11.1	PDG000000004.2233	capU=COMPLETE,espX1=COMPLETE,fdeC=COMPLETE,iss=COMPLETE,sslE=HMM,ybtP=COMPLETE,ybtQ=COMPLETE	Escherichia coli	NA	Canis lupus familiaris	aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE	ECOL-18-VL-LA-PA-Ryan-0026	NA	PDT000545912.1	2019-07-16T12:52:12Z	USA:PA	trach wash	environmental/other	PDS000046501.12	2	11	SAMN11230749	GCA_007012305.1	aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,acrF=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mdtM=COMPLETE,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE	GCA_007012305.1_ASM701230v1_genomic_prokka

The associated links to Bioprojects and accessions:
https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=11230749

The one I used:
https://www.ncbi.nlm.nih.gov/sra/SRX5557610[accn]

I chose this one because it was a MiSeq run, not PacBio

hollygene · 2021-05-20T14:50:56Z

I manually added a column to the .csv file with the Assembly (GCA_007012305.1) so that I could match it to the other file I already had in R

hollygene · 2021-05-25T18:11:45Z

Need to convert between IDs that panaroo assigns to more useful Gene IDs that can be used in a reference database
Probably want more than one type of identifier - maybe gene symbol and refseq acc

hollygene · 2021-06-10T21:00:15Z

For getting gene identifiers:

use Kristina's script to get a fasta file of all protein sequences from panaroo output https://github.com/hollygene/CornellPostdoc/blob/e242d67cdf70e598e254d8fe1c9a6e88eb8d8d34/panaroo_protein_fasta_out_kristina.R
take this fasta file and throw it into blastp with the following parameters:
blastp -outfmt "6 qseqid sseqid sacc sgi qaccver pident ssciname length mismatch gapopen qstart qend sstart send evalue bitscore" \ -query /workdir/hcm59/Ecoli/SNPs/dog_verified_host/ecoli_all_proteins_out.fasta -db nr -out ./ecoli_all_prot_sci_Name.out -num_threads 36 -max_target_seqs 5
make sure the output is sorted by best matches for each protein
filter out the first option for each protein: awk -F"\t" '!_[$1]++' ecoli_acc_proteins_blastp_accVer.out > ecoli_acc_proteins_blastp_accVer_unique.out
use this output as a gene key in R

hollygene created this issue from a note in E_coli_AMR (In progress) Apr 8, 2021

hollygene moved this from In progress to Done in E_coli_AMR Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

hollygene commented Apr 8, 2021

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented Apr 21, 2021

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented Apr 28, 2021

hollygene commented Apr 28, 2021

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 3, 2021

hollygene commented May 4, 2021

hollygene commented May 4, 2021 •

edited

Loading

hollygene commented May 4, 2021

hollygene commented May 10, 2021

hollygene commented May 10, 2021

hollygene commented May 10, 2021

hollygene commented May 11, 2021

hollygene commented May 11, 2021

hollygene commented May 19, 2021

hollygene commented May 20, 2021

hollygene commented May 20, 2021

hollygene commented May 20, 2021

hollygene commented May 25, 2021

hollygene commented Jun 10, 2021

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

Comments

hollygene commented Apr 8, 2021

hollygene commented Apr 21, 2021 • edited Loading

hollygene commented Apr 21, 2021

hollygene commented Apr 21, 2021 • edited Loading

hollygene commented Apr 21, 2021 • edited Loading

hollygene commented Apr 28, 2021

hollygene commented Apr 28, 2021

hollygene commented May 3, 2021 • edited Loading

hollygene commented May 3, 2021

hollygene commented May 4, 2021

hollygene commented May 4, 2021 • edited Loading

hollygene commented May 4, 2021

hollygene commented May 10, 2021

hollygene commented May 10, 2021

hollygene commented May 10, 2021

hollygene commented May 11, 2021

hollygene commented May 11, 2021

hollygene commented May 19, 2021

hollygene commented May 20, 2021

hollygene commented May 20, 2021

hollygene commented May 20, 2021

hollygene commented May 25, 2021

hollygene commented Jun 10, 2021

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented Apr 21, 2021 •

edited

Loading

hollygene commented May 3, 2021 •

edited

Loading

hollygene commented May 4, 2021 •

edited

Loading