-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9
Comments
With help from Kristina: Scoary outputs a separate .csv file for each antibiotic (phenotypic class) |
So what do we want to know, and how can we answer these questions?
|
For question 1: Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
Get list of gene IDs, and search AMRFinder/other databases for these IDs |
Need to get gene IDs of accessory genes:
|
odds ratio: whether it is correlated with 1 (resistant) or 0 (susceptible) |
Used script from Kristina:
Input:
Output: |
Line 1 in 9dfb027
Input: |
blastp specific command: Lines 11 to 12 in 41c55b7
Lines 14 to 29 in 41c55b7
Then filter in bash |
Bash filtering: Line 33 in 41c55b7
|
From this file, I took the first two columns (qseqid and sseqid) and pasted them into Excel. I used text to columns to separate sseqid by _, then deleted everything except gene ID and species. I then renamed the columns "PanGene" "GeneID" and "Org" |
I actually redid this a different way because I wasn't confident that the first way I did it was correct so I did this:
The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think). |
To analyze Scoary output, I'm using R I first loaded in all of the Scoary output .csv files into one list in R Line 22 in c1e8c44
I then filtered by Empirical p value (indicating the gene is significantly associated with something) cutoff <0.05 Line 54 in c1e8c44
Then I filtered based on Odds ratio > 1 (indicating the gene is associated with resistance) Line 55 in c1e8c44
|
I created a function to take an antibiotic as input and spit out a fasta file of all of the nucleotide sequences of the genes that are significantly associated with resistance Lines 148 to 152 in c1e8c44
|
We found that several of the antibiotics tested had no significantly associated genes from Scoary and the reason behind this was the sample size was too low. We quantified the sample sizes for each antibiotic: <style> </style>
|
From this, we decided that our cutoff would be 100 samples per antibiotic minimum, that leaves us with 19 antibiotics |
Cefalexin and Cephalexin are both included in the antibiotics, but actually are the same thing just different spellings, so I combined them |
Want to find what common genes were found between Scoary and AMRFinder, so using the AMRFinder output from the Bioprojects But, several samples were missing from the original output of that (57 samples) Now concatenating that output with the original AMRFinder output (the 554 samples we did have) to get a final list of AMRFinder output for comparison with Scoary outputs |
One sample was left off because it didn't have an accession number The sample:
The associated links to Bioprojects and accessions: The one I used: I chose this one because it was a MiSeq run, not PacBio |
I manually added a column to the .csv file with the Assembly (GCA_007012305.1) so that I could match it to the other file I already had in R |
Need to convert between IDs that panaroo assigns to more useful Gene IDs that can be used in a reference database |
For getting gene identifiers:
|
No description provided.
The text was updated successfully, but these errors were encountered: