Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

Open
hollygene opened this issue Apr 8, 2021 · 22 comments
Projects

Comments

@hollygene
Copy link
Owner

No description provided.

@hollygene hollygene created this issue from a note in E_coli_AMR (In progress) Apr 8, 2021
@hollygene
Copy link
Owner Author

hollygene commented Apr 21, 2021

With help from Kristina:
Roary Panaroo was ran on the 601 sequences we have phenotypic data for
Scoary was then run on the Roary Panaroo output + a tree generated by IQTree

Scoary outputs a separate .csv file for each antibiotic (phenotypic class)

@hollygene
Copy link
Owner Author

Screen Shot 2021-04-21 at 4 39 32 PM
Example of what this .csv file looks like

@hollygene
Copy link
Owner Author

hollygene commented Apr 21, 2021

So what do we want to know, and how can we answer these questions?

  1. Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
  2. How many genes are significantly associated with antibiotic resistance, per antibiotic class tested?
  3. How many genes are associated with more than one antibiotic/what are they?
  4. What are the functional annotations of the significantly associated genes?
  5. *this will likely need to be combined with treeWAS results - Allele frequency assoc with each resistance gene?

@hollygene
Copy link
Owner Author

hollygene commented Apr 21, 2021

For question 1:

Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
Need: scoary output + AMRFinder results + maybe other databases?

- using our AMRFinder results as a "reference," grep the column Non_unique_gene_name and get non-matches

Get list of gene IDs, and search AMRFinder/other databases for these IDs

@hollygene
Copy link
Owner Author

Need to get gene IDs of accessory genes:

  1. find all protein sequences for acc genes, select one representative for each gene
  2. write multi fasta file for the protein sequences
  3. use blast-p to assign gene IDs to each protein seq
  4. use blast2go or PANTHER for functional annotation

@hollygene
Copy link
Owner Author

odds ratio: whether it is correlated with 1 (resistant) or 0 (susceptible)
greater than 1: significantly associated with resistance
less than 1: significantly associated with susceptibility

@hollygene
Copy link
Owner Author

hollygene commented May 3, 2021

  1. get protein sequences for all accessory genes
  2. write multi fasta file for the protein sequences

Used script from Kristina:

Input:

"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv"
"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_presence_absence_roary.csv"

Output: /Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/dogEcoli_acc_proteins_out.fasta

@hollygene
Copy link
Owner Author

  1. use blast-p to assign gene IDs to each protein sequence

#blastp on server

Input: dogEcoli_acc_proteins_out.fasta (from R script above)
Output: dog_verified_host_prots_tab.out (in tab-delimited format)

@hollygene
Copy link
Owner Author

blastp specific command:

CornellPostdoc/blastp.sh

Lines 11 to 12 in 41c55b7

blastp -outfmt "6 qseqid sseqid sallseqid qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore" \
-query ./dogEcoli_acc_proteins_out.fasta -db swissprot -out ./dog_verified_host_prots_tab_oneSeq.out -num_threads 36 -max_target_seqs 1

CornellPostdoc/blastp.sh

Lines 14 to 29 in 41c55b7

# Outputs:
# qseqid means Query Seq-id
# sseqid means Subject Seq-id
# sallseqid means All subject Seq-id(s), separated by a ';'
# qaccver means Query accesion.version
# saccver means Subject accession.version
# pident means Percentage of identical matches
# length means Alignment length
# mismatch means Number of mismatches
# gapopen means Number of gap openings
# qstart means Start of alignment in query
# qend means End of alignment in query
# sstart means Start of alignment in subject
# send means End of alignment in subject
# evalue means Expect value
# bitscore means Bit score

Then filter in bash

@hollygene
Copy link
Owner Author

hollygene commented May 4, 2021

Bash filtering:
Wanted to get the best hit for each unique gene
sort by field one and field two (numeric, reverse) so that min for each key will be top of the group,
pick the first for each key by the second sort.

sort -k1,1 -k2,2n file | sort -u -k1,1 dog_verified_host_prots_tab_more.out > test.txt

@hollygene
Copy link
Owner Author

From this file, I took the first two columns (qseqid and sseqid) and pasted them into Excel. I used text to columns to separate sseqid by _, then deleted everything except gene ID and species. I then renamed the columns "PanGene" "GeneID" and "Org"

@hollygene
Copy link
Owner Author

I actually redid this a different way because I wasn't confident that the first way I did it was correct

so I did this:

sort -k1,1 -k15,15nr -k14,14n dog_verified_host_prots_tab_more.out > test1.txt
sort -u -k1,1 test1.txt > test.txt

The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think).
The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

@hollygene
Copy link
Owner Author

To analyze Scoary output, I'm using R

I first loaded in all of the Scoary output .csv files into one list in R

dataFiles <- sapply(Sys.glob("./*.csv"), read.csv, simplify = FALSE, USE.NAMES = TRUE)

I then filtered by Empirical p value (indicating the gene is significantly associated with something) cutoff <0.05

sigGenes <- sapply(dataFiles,FUN = function(x) subset(x, Empirical_p<0.05 ),simplify = FALSE,USE.NAMES = TRUE)

Then I filtered based on Odds ratio > 1 (indicating the gene is associated with resistance)

sigResistAssoc <- sapply(sigGenes,FUN = function(x) subset(x, Odds_ratio>1 ),simplify = FALSE,USE.NAMES = TRUE)

@hollygene
Copy link
Owner Author

I created a function to take an antibiotic as input and spit out a fasta file of all of the nucleotide sequences of the genes that are significantly associated with resistance

get_nt_fasta <- function(antibiotic) {
x <- inner_join(gene_pres_abs, antibiotic,by=c("Gene" = "Gene"))
y <- get_prot_seq("/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv",x,T)
write_fasta_out(g, paste("/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/",antibiotic,".fasta"))
}

@hollygene
Copy link
Owner Author

We found that several of the antibiotics tested had no significantly associated genes from Scoary and the reason behind this was the sample size was too low.

We quantified the sample sizes for each antibiotic:

<style> </style>
Antibiotic Phenotypic Datapoints # Sig Assoc Genes from Scoary
Oxacillin.INT 1 -
Polymyxin.B.INT 1 -
Amoxicillin.INT 9 0
Penicillin.G.INT 9 0
Oxacillin...2..NaCl.INT 10 0
Penicillin.INT 11 0
Neomycin.INT 16 0
Piperacillin.INT 16 117
Tobramycin.INT 17 34
Nitrofurantoin.INT 25 0
Clindamycin.INT 27 0
Erythromycin.INT 34 0
Ceftiofur.INT 37 278
Cephalexin.INT 37 2
Ticarcillin.Clavulanic.Acid.INT 48 220
Ticarcillin.INT 51 160
Cephalothin.INT 63 76
Cefoxitin.INT 96 216
Pradofloxacin.INT 150 427
Cefovecin.INT 272 680
Cefalexin.INT 277 619
Ceftazidime.INT 323 424
Piperacillin.Tazobactam.INT 346 157
Orbifloxacin.INT 387 649
Doxycycline.INT 416 895
Marbofloxacin.INT 445 781
Cefpodoxime.INT 448 780
Chloramphenicol.INT 452 434
Cefazolin.INT 461 816
Imipenem.INT 474 235
Amikacin 497 312
Enrofloxacin.INT 503 610
Ampicillin.INT 509 894
Tetracycline.INT 527 1117
Amoxicillin.Clavulanic.Acid.INT 531 750
Gentamicin.INT 560 573
Trimethoprim.Sulfamethoxazole.INT 586 704

@hollygene
Copy link
Owner Author

From this, we decided that our cutoff would be 100 samples per antibiotic minimum, that leaves us with 19 antibiotics

@hollygene
Copy link
Owner Author

Cefalexin and Cephalexin are both included in the antibiotics, but actually are the same thing just different spellings, so I combined them

@hollygene
Copy link
Owner Author

Want to find what common genes were found between Scoary and AMRFinder, so using the AMRFinder output from the Bioprojects
#4

But, several samples were missing from the original output of that (57 samples)
So created a list of those accession numbers and pulled them from NCBI
Ran AMRFinder on those

Now concatenating that output with the original AMRFinder output (the 554 samples we did have) to get a final list of AMRFinder output for comparison with Scoary outputs

@hollygene
Copy link
Owner Author

One sample was left off because it didn't have an accession number
This seems to be because it had been through several sequencing rounds

The sample:

6 3 PA-Ryan 2018-07-09 5135313   HGAP v. 4.0   Complete Genome   562 NA NA NA NA 562         4857938   3.8.4 PRJNA324573 COMBINED 2020-06-11.1 PDG000000004.2233 capU=COMPLETE,espX1=COMPLETE,fdeC=COMPLETE,iss=COMPLETE,sslE=HMM,ybtP=COMPLETE,ybtQ=COMPLETE Escherichia coli NA Canis lupus familiaris aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE ECOL-18-VL-LA-PA-Ryan-0026 NA PDT000545912.1 2019-07-16T12:52:12Z USA:PA trach wash environmental/other PDS000046501.12 2 11 SAMN11230749 GCA_007012305.1 aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,acrF=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mdtM=COMPLETE,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE GCA_007012305.1_ASM701230v1_genomic_prokka

The associated links to Bioprojects and accessions:
https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=11230749

The one I used:
https://www.ncbi.nlm.nih.gov/sra/SRX5557610[accn]

I chose this one because it was a MiSeq run, not PacBio

@hollygene
Copy link
Owner Author

I manually added a column to the .csv file with the Assembly (GCA_007012305.1) so that I could match it to the other file I already had in R

@hollygene
Copy link
Owner Author

Need to convert between IDs that panaroo assigns to more useful Gene IDs that can be used in a reference database
Probably want more than one type of identifier - maybe gene symbol and refseq acc

@hollygene
Copy link
Owner Author

For getting gene identifiers:

  1. use Kristina's script to get a fasta file of all protein sequences from panaroo output https://github.com/hollygene/CornellPostdoc/blob/e242d67cdf70e598e254d8fe1c9a6e88eb8d8d34/panaroo_protein_fasta_out_kristina.R
  2. take this fasta file and throw it into blastp with the following parameters:
    blastp -outfmt "6 qseqid sseqid sacc sgi qaccver pident ssciname length mismatch gapopen qstart qend sstart send evalue bitscore" \ -query /workdir/hcm59/Ecoli/SNPs/dog_verified_host/ecoli_all_proteins_out.fasta -db nr -out ./ecoli_all_prot_sci_Name.out -num_threads 36 -max_target_seqs 5
  3. make sure the output is sorted by best matches for each protein
  4. filter out the first option for each protein: awk -F"\t" '!_[$1]++' ecoli_acc_proteins_blastp_accVer.out > ecoli_acc_proteins_blastp_accVer_unique.out
  5. use this output as a gene key in R

@hollygene hollygene moved this from In progress to Done in E_coli_AMR Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
E_coli_AMR
  
Done
Development

No branches or pull requests

1 participant