Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run treeWAS #3

Open
hollygene opened this issue Mar 15, 2021 · 13 comments
Open

Run treeWAS #3

hollygene opened this issue Mar 15, 2021 · 13 comments

Comments

@hollygene
Copy link
Owner

No description provided.

@hollygene
Copy link
Owner Author

Maybe need to run Gblocks or GEM mappability on this dataset as there is that one odd sequence out in the tree

@hollygene
Copy link
Owner Author

Do we want to filter by alignment quality or mappability, or both?

Difference is:
Alignment quality is the quality of matching between your sample and the reference
Mapping quality is the confidence that the read is actually mapped to the right site

Ex: could have a read that aligns perfectly in several locations, this gives it a high alignment score, but you don't have confidence where it actually belongs so you have a low mappability score

@hollygene
Copy link
Owner Author

Resources:
GeM for mappability:
https://www.biorxiv.org/content/10.1101/611160v1.full

Gblocks for alignment:
http://molevol.cmima.csic.es/castresana/Gblocks.html

@hollygene
Copy link
Owner Author

hollygene commented Mar 17, 2021

Problem running treeWAS on R on server: running into memory issues

  • used 36 threads
    Need to:
  • Remove columns (SNPs) with >50% NAs
  • Remove unnecessary objects from the environment
  • Try different number of threads
  • Look at treeWAS Github issues to see if anyone else had similar issues

@hollygene
Copy link
Owner Author

Still had issues running treeWAS:
I’m pretty sure the reason I can’t get it to work is because we have 500k SNPs. I even cut down the number of SNPs by removing some duplicates, but the matrix is still ~600 x 500000 which makes for 300 million data points… A task R just can’t really handle.
I’m thinking if we run Gblocks on the alignment, this might eliminate some places where there were a lot of SNPs (poor alignment = more SNPs?). Then on the newer, smaller alignment, we can use snp-sites again, and hopefully this will give us a more reasonable number of SNPs.
Also, treeWAS eliminates any samples that don’t have phenotypic data, so it will be best to get rid of those samples in the alignment as well (for this analysis).

@hollygene
Copy link
Owner Author

hollygene commented Mar 19, 2021

Ran Gblocks on core gene alignment

export PATH=/programs/Gblocks_0.91b:$PATH
Gblocks /workdir/hcm59/Ecoli/SNPs/core_gene_alignment.aln -t=d

Running snp-sites on new Gblocks alignment now

@hollygene
Copy link
Owner Author

output files from Gblocks:
Ecoli_subset_snpsGblocks.snp_sites.aln

This file is subsettted to only the samples we have phenotypic data for as well as the one long branch sequence removed

@hollygene
Copy link
Owner Author

treeWAS output:
print(out,sort.by.p=TRUE)
treeWAS.combined: pooled set of significant loci identified by any association score
treeWAS: a list with the sets of significant loci determined by each association score individually

@hollygene
Copy link
Owner Author

Some samples in phenotypic dataset that aren't dogs:
strain ECOL-17-VL-LA-KS-0009 Bos taurus
strain ECOL-17-VL-LA-TX-0003 Canis rufus
strain ECOL-17-VL-SD-NC-0028 Felis catus
strain ECOL-17-VL-LA-KS-0046 Felis catus
strain ECOL-17-VL-LA-KS-0031 Sus scrofa domesticus
strain ECOL-18-VL-SD-NC-0026 Equus ferus caballus
strain ECOL-18-VL-SD-NC-0001 Equus ferus caballus
strain ECOL-18-VL-SD-NC-0003 Felis catus
strain ECOL-18-VL-LA-KS-0004 Sus scrofa domesticus
strain ECOL-19-VL-LA-KS-0035 Felis catus

@hollygene
Copy link
Owner Author

(Removed the above samples)

@hollygene
Copy link
Owner Author

hollygene commented Apr 1, 2021

file with samples we want:
https://github.com/hollygene/CornellPostdoc/blob/98881dff0653805c747fde0a7e3648f26356de6c/phenDogs_samples_wanted.txt

identified by assemblies

if that doesn't work I can use a different ID (strain, I think)

@hollygene
Copy link
Owner Author

Run R on AHDC server:

screen
export OMP_NUM_THREADS=8
R

@hollygene
Copy link
Owner Author

hollygene commented Apr 8, 2021

Could be beneficial to use ClonalFrameML with treeWAS to correct for recombination
treeWAS has information on how to incorporate the two: https://github.com/caitiecollins/treeWAS/wiki/5.-ClonalFrameML-Integration

  • unnecessary - we already know it is really recombinant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant