-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSNP-filtering.Rmd
210 lines (139 loc) · 7.74 KB
/
SNP-filtering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "Mahoenui Giant Wētā Supplementary Material"
author: "Nat Forsdick"
date: "2024-11-25"
output:
pdf_document: default
html_document: default
---
This supplementary file to the manuscript '*Population genomic analysis of Mahoenui giant wētā (*Deinacrida mahoenui*) reveals minimal reduction in genomic diversity following translocation*' (Forsdick et al. 2025) presents the output of SNP filtering and downstream analyses, predominantly using the SNPfiltR package.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE, fig.width = 7, fig.height = 6, ppi=200)
```
Get R and package details.
```{r}
#install.packages("SNPfiltR")
library(SNPfiltR)
library(vcfR)
R.Version()
(.packages())
mwlrcols <- c("#88cfa8", "cornflowerblue", "#898a8d")
```
Get vcf `populations-all2.snps.vcf`, and file encoding populations `Weta_GBS_Batch2_POP_blankrem.txt`.
```{r}
vcf <- read.vcfR("/path/to/Weta-GBS/2024-11-01-Results/populations-all2.snps.vcf")
pop_code <- read.delim("/path/to/Weta-GBS/2024-11-01-Results/Weta_GBS_Batch2_POP_blankrem.txt", sep="\t", header=F)
colnames(pop_code) <- c("id", "pop")
```
Visualise variant characteristics.
```{r}
hard_filter(vcfR=vcf)
```
Initial filtering on depth >=5, Q>=30.
Then check for allele balance.
Then consider maximum depth.
```{r}
vcf2<-hard_filter(vcfR=vcf, depth = 5, gq = 30)
#execute allele balance filter
vcf3<-filter_allele_balance(vcf2)
#visualize and pick appropriate max depth cutoff
max_depth(vcf3)
#filter vcf by the max depth cutoff you chose
vcf4<-max_depth(vcf3, maxdepth = 40)
# 0.66% of SNPs were above a mean depth of 50
# 0.7% of SNPs were above a mean depth of 48
# 0.75% of SNPs were above a mean depth of 46
# 7.41% of SNPs were above a mean depth of 28 and were removed from the vcf
# 5.5% of SNPs were above a mean depth of 30 and were removed from the vcf
# 1.05% of SNPs were above a mean depth of 40 and were removed from the vcf
vcf4
```
Now let's look at per sample missingness.
```{r}
#run function to visualize samples
missing_by_sample(vcfR=vcf4, popmap = pop_code)
# really useful outputs here
```
We see a couple of SR individuals that seem to be outliers with relatively high missingness and relatively low depth - likely the same 2 indivs.Now let's exclude samples above the missingness threshold of 0.8.
Then let's exclude monomorphic sites using `min.mac = 1`.
```{r}
#run function to drop samples above the threshold we want from the vcf
vcf5<-missing_by_sample(vcfR=vcf4, cutoff = .80)
#subset popmap to only include retained individuals
pop_code2<-pop_code[pop_code$id %in% colnames(vcf5@gt),]
vcf5
# check min mac
vcf.mac1 <- min_mac(vcf5, min.mac = 1)
vcf.mac1
```
Now let's check missingness by site across different thresholds (0.6, 0.7, 0.75, 0.8).
```{r}
#visualize missing data by SNP and the effect of various cutoffs on the missingness of each sample
missing_by_snp(vcf.mac1)
#verify that missing data is not driving clustering patterns among the retained samples
# threshold = SNP completeness cutoff
miss.mac1 <- assess_missing_data_pca(vcfR=vcf.mac1, popmap = pop_code2, thresholds = c(.6,.7,.75,.8), clustering = FALSE)
```
Notes from the tutorial "A good rule of thumb is that individual samples shouldn’t be above 50% missing data after applying a per-SNP missing data cutoff. So if we are retaining specific low data samples out of necessity or project design, we may have to set a more stringent per-SNP missing data cutoff, at the expense of the total number of SNPs retained for downstream analyses."
ok, so I think we want a cutoff of 0.6 - gives more than 10k SNPs, so still ok to filter a little further.
```{r}
#
#choose a value that retains an acceptable amount of missing data in each sample, and maximizes SNPs retained while minimizing overall missing data, and filter vcf
vcf.mac1.snpmiss60<-missing_by_snp(vcf.mac1, cutoff = .6)
vcf.mac1.snpmiss60
```
Now let's consider different minimum minor allele counts - we ideally want to exclude singletons (and private doubletons) that may impact structure analysis. Let's consider `min.mac = {2,3}`.
```{r}
#investigate clustering patterns with and without a minor allele cutoff
#use min.mac() to investigate the effect of multiple cutoffs
vcf.snpmiss60.mac2 <-min_mac(vcfR = vcf.mac1.snpmiss60, min.mac = 2)
vcf.snpmiss60.mac2
vcf.snpmiss60.mac3 <-min_mac(vcfR = vcf.mac1.snpmiss60, min.mac = 3)
vcf.snpmiss60.mac3
#assess clustering without MAC cutoff
miss<-assess_missing_data_tsne(vcf.mac1.snpmiss60, pop_code2, clustering = FALSE)
#assess clustering with MAC cutoff
miss<-assess_missing_data_tsne(vcf.snpmiss60.mac2, pop_code2, clustering = FALSE)
#assess clustering with MAC cutoff
miss<-assess_missing_data_tsne(vcf.snpmiss60.mac3, pop_code2, clustering = FALSE)
```
Now let's look at the filtered set with min.MAC = a) 1 and b) 2 and consider per sample 1) depth and 2) genotype quality.
```{r}
#plot depth per snp and per sample
dp <- extract.gt(vcf.snpmiss60.mac2, element = "DP", as.numeric=TRUE)
heatmap.bp(dp, rlabels = FALSE)
#plot genotype quality per snp and per sample
gq <- extract.gt(vcf.snpmiss60.mac2, element = "GQ", as.numeric=TRUE)
heatmap.bp(gq, rlabels = FALSE)
#plot depth per snp and per sample
dp <- extract.gt(vcf.snpmiss60.mac3, element = "DP", as.numeric=TRUE)
heatmap.bp(dp, rlabels = FALSE)
#plot genotype quality per snp and per sample
gq <- extract.gt(vcf.snpmiss60.mac3, element = "GQ", as.numeric=TRUE)
heatmap.bp(gq, rlabels = FALSE)
```
Then we can thin the variants to only a single SNP per 500 bp, and write out the filtered VCF `weta.snpmiss60.mac3.thin.vcf`.
```{r }
#write out vcf with all SNPs
vcfR::write.vcf(vcf.snpmiss60.mac2, "/path/to/Weta-GBS/2024-11-01-Results/weta.snpmiss60.mac2.vcf.gz")
#linkage filter vcf to thin SNPs to one per 500bp
vcfR.thin<-distance_thin(vcf.snpmiss60.mac2, min.distance = 500)
#write out vcf with thinned SNPs
vcfR::write.vcf(vcfR.thin, "/path/to/Weta-GBS/2024-11-01-Results/weta.snpmiss60.mac2.thin.vcf.gz")
#linkage filter vcf to thin SNPs to one per 500bp
vcfR.thin3<-distance_thin(vcf.snpmiss60.mac3, min.distance = 500)
#write out vcf with thinned SNPs
vcfR::write.vcf(vcfR.thin3, "/path/to/Weta-GBS/2024-11-01-Results/weta.snpmiss60.mac3.thin.vcf.gz")
```
```{r}
## adding outlier analysis - results of BayeScan for SNPs under selection
source("/path/to/Weta-GBS/2024-11-01-Results/bayescan-out/plot_R.r")
plot_bayescan("/path/to/Weta-GBS/2024-11-01-Results/bayescan-out/out-weta.snpmiss60.mac3.thin.bayes_fst.txt", FDR=0.05)
#tiff("/path/to/Weta-GBS/2024-11-01-Results/bayescan-out/bayescan-outliers.tiff", compression = "lzw", units = "in", width=5, height = 5, res = 400)
#plot_bayescan("/path/to/Weta-GBS/2024-11-01-Results/bayescan-out/out-weta.snpmiss60.mac3.thin.bayes_fst.txt", FDR=0.05)
#dev.off()
```
We used BayeScan to screen the VCF for putative SNPs under selection. The figure above shows the results of this analysis, plotted using the `plot_bayescan` function, to identify outlier SNPs.
Below we reproduce the plots of fastSTRUCTURE results where individuals are ordered by 1) sex, and 2) colour. In these plots, each vertical line represents one individual, coloured according to cluster assignment, when _K_ = 3. We do not reproduce the plots for _K_ = 2, as no population structure was identified. There does not appear to be any association between weak population structure and either of these characteristics.
![Sex]("/path/to/Weta-GBS/2024-11-01-Results/FastStructure/figures/CLUMPP_merged_k3_sex.png"){width="120%"}
![Colour]("/path/to/Weta-GBS/2024-11-01-Results/FastStructure/figures/CLUMPP_merged_k3_sex.png"){width="120%"}