-
Notifications
You must be signed in to change notification settings - Fork 0
/
DGE_Analysis.Rmd
432 lines (323 loc) · 17.2 KB
/
DGE_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
---
title: "Identifying Differentially Expressed Genes in response to Exercise training in Mice"
author: "21BT10024 - Rohan R. Barsagade"
date: "May 2, 2024"
output:
html_document:
theme: cerulean
self_contained: false
keep_md: yes
css: styles.css
---
```{r setup, echo=FALSE}
if (dir.exists("result/plots")) {
unlink("result/plots", recursive = TRUE)
}
knitr::opts_chunk$set(fig.path = "result/plots/")
```
# Introduction
In this study, we downloaded a publicly available gene expression dataset [GSE227516](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE227516){.uri} from Gene Expression Omnibus (GEO) which investigated the effects of exercise on pancreatic islet function in mice. Our aim is to analyze this dataset using DESeq2 to identify genes that show altered expression in response to exercise training.
# Install and Load Packages
```{r warning=F, message=F, class.source="code", class.output="code"}
# Install Required Packages (if not already installed)
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("DESeq2", quietly = TRUE)) {
BiocManager::install("DESeq2")
}
if (!requireNamespace("RUVSeq", quietly = TRUE)) {
BiocManager::install("RUVSeq")
}
if (!requireNamespace("pheatmap", quietly = TRUE)) {
BiocManager::install("pheatmap")
}
if (!requireNamespace("RColorBrewer", quietly = TRUE)) {
BiocManager::install("RColorBrewer")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (!requireNamespace("ggrepel", quietly = TRUE)) {
install.packages("ggrepel")
}
# Load Packages
library(DESeq2)
library(RUVSeq)
library(pheatmap)
library(RColorBrewer)
library(ggplot2)
library(ggrepel)
```
# Data Import
We begin by importing the gene expression counts and sample information. The gene expression counts provide quantitative data on the expression levels of various genes across different samples, while the sample information helps us understand the experimental conditions associated with each sample.
```{r warning=FALSE, message=FALSE, class.source="code", class.output="code"}
# Import Gene Counts
COUNTS <- read.csv(file="data/GSE227516_counts.csv", header=TRUE, row.names=1)
head(COUNTS)
```
```{r class.source="code", class.output="code"}
# Import Sample Information
META <- read.csv(file="data/sample_information.csv", header=TRUE)
head(META)
```
# Data Preprocessing
Before diving into the analysis, we perform some basic data exploration and cleaning steps. This involves inspecting the structure of the imported data, and ensuring that the data are formatted correctly for further analysis.
```{r class.source="code", class.output="code"}
# Reorder columns
COUNTS <- COUNTS[, c("P1", paste0("P", 2:9), "P10")]
head(COUNTS)
```
```{r class.source="code", class.output="code"}
# Rounding off and convert to matrix
COUNTS <- round(COUNTS)
COUNTS <- as.matrix(COUNTS)
```
Dimensions of `COUNTS` matrix are : **`r nrow(COUNTS)`** Rows x **`r ncol(COUNTS)`** Columns
```{r class.source="code", class.output="code"}
# List unique sample conditions
unique(META$condition)
```
# Differential Gene Expression Analysis
## Create DESeq2 Dataset
We create a DESeq2 dataset from the gene expression counts and sample information. This step involves organizing the data into a format suitable for DESeq2 analysis, specifying the experimental design.
```{r warning=FALSE, message=FALSE, class.source="code", class.output="code"}
# Creating DESeq2 Dataset
dds <- DESeqDataSetFromMatrix(countData = COUNTS,
colData=META,
design=~condition)
```
```{r class.source="code", class.output="code"}
dim(dds)
```
## Filtering Low-Count Genes
To improve the quality of our analysis and focus on genes with sufficient expression levels, we filter out genes with low counts. This step helps reduce noise and computational burden in downstream analyses.
```{r class.source="code", class.output="code"}
# Filtering low count genes
threshold <- 10
dds <- dds[ rowMeans(counts(dds)) >= threshold,]
```
The dimensions of the filtered dataset is: **`r nrow(dds)`** Rows x **`r ncol(dds)`** Columns
## DESeq2 Analysis
We perform differential expression analysis using DESeq2, which includes estimating size factors and dispersion, and fitting the negative binomial model.
```{r warning=FALSE, message=FALSE, class.source="code", class.output="code"}
# DESeq2 Analysis
prdds <- DESeq(dds)
prdds
```
## Normalization and Transformation
Normalization and transformation are crucial steps to account for systematic biases and heterogeneity in sequencing data. We apply size factor estimation, regularized log transformation (rlog), and variance stabilizing transformation (VST) to normalize and transform the expression data. We will perform these steps and visually compare the results to understand their impact on the data.
```{r class.source="code", class.output="code"}
# Normalization
norm_counts <- counts(prdds, normalized = TRUE)
norm_counts <- as.data.frame(norm_counts)
head(norm_counts)
```
```{r class.source="code", class.output="code"}
# Transformation
mks <- estimateSizeFactors(dds)
rld <- rlogTransformation(prdds, blind = FALSE)
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
```
### Comparison of Transformations
Let's compare the distributions of the raw counts, rlog transformed data, and VST transformed data using scatter plots and histograms.
```{r scatter_plots, dev='png', fig.show='hide', class.source="code", class.output="code"}
# Scatter Plots Comparison
par(mfrow=c(1, 3))
lims <- c(-2, 20)
plot(log2(counts(mks, normalized=TRUE)[,1:2] + 1),pch=16, cex=0.3, main="log2(x + 1)", xlim=lims, ylim=lims)
plot(assay(rld)[,1:2], pch=16, cex=0.3, main="R log", xlim=lims, ylim=lims)
plot(assay(vsd)[,1:2], pch=16, cex=0.3, main="VST", xlim=lims, ylim=lims)
```
![](`r knitr::fig_chunk('scatter_plots', 'png')`)
```{r hist_plots, dev='png', fig.show='hide', class.source="code", class.output="code"}
# Histograms Comparison
par(mfrow=c(1, 3))
hist(counts(mks))
hist(assay(rld))
hist(assay(vsd))
```
![](`r knitr::fig_chunk('hist_plots', 'png')`)
By visually comparing these plots, we can see how the transformations affect the distribution of the data. The RLog Transform is leading to more symmetrical distributions in both the plots.
## Exploratory Data Analysis
Before conducting differential expression analysis, we perform exploratory data analysis (EDA) to assess the quality of the data and explore patterns within the dataset. This includes visualizing sample-to-sample distances, examining dispersion estimates, and conducting principal component analysis (PCA) to identify potential batch effects or sample outliers.
### Heatmap of sample-to-sample distances
```{r s2s_heatmap_plot, dev='png', fig.show='hide', class.source="code", class.output="code"}
# Sample-to-sample distances
sample_dist <- dist(t(assay(rld)))
sample_dist_matrix <- as.matrix(sample_dist)
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sample_dist_matrix,
clustering_distance_rows=sample_dist,
clustering_distance_cols=sample_dist,
col=colors,)
```
![](`r knitr::fig_chunk('s2s_heatmap_plot', 'png')`)
In this heatmap, each row and column represents a sample, and the color intensity in the cell at the intersection of two rows and columns represents the distance between those two samples. Samples that are more similar have a darker color, while samples that are more dissimilar have a lighter color.
### PCA Plot
```{r pca_plot, dev='png', fig.show='hide', message=FALSE, warning=FALSE, class.source="code", class.output="code"}
# PCA Plot
pca_data <- plotPCA(rld, intgroup = c("condition"), returnData = TRUE)
ggplot(pca_data, aes(x = PC1, y = PC2)) +
geom_point(size = 2, aes(color = condition)) +
geom_text_repel(aes(label = rownames(pca_data)), nudge_x = 0, nudge_y = 0) +
xlab(paste0("PC1: ", round(attr(pca_data, "percentVar")[1], 2) * 100, "% variance")) +
ylab(paste0("PC2: ", round(attr(pca_data, "percentVar")[2], 2) * 100, "% variance"))
```
![](`r knitr::fig_chunk('pca_plot', 'png')`)
Samples from the sedentary condition (P1, P2, P3, P4, and P5) are more spread out on the left side of the plot. This suggests that there is more variation in gene expression among these samples compared to the exercise samples. P3, P4, and P5 appear closer together, while P1 and P2 are more separate. This could indicate that there are subgroups within the sedentary condition with distinct gene expression patterns. Samples from the exercise condition (P6, P7, P8, P9, and P10) cluster together on the right side of the plot. This indicates that these samples have similar gene expression profiles.
We can conclude that ***exercise induces a strong and consistent change in gene expression***. This could be due to the activation of specific biological pathways in response to exercise.
### Dispersion Plot
In the Dispersion Plot, the x-axis represents the mean of normalized counts, which is a measure of how much a gene is expressed on average across samples. The y-axis represents dispersion, which is a measure of how much the counts vary across samples. The red curve in the plot is a fitted model that tries to explain the relationship between the mean of normalized counts and dispersion.
```{r dispersion_plot, dev='png', fig.show='hide', class.source="code", class.output="code"}
# Dispersion Plot
plotDispEsts(prdds, main = "Dispersion plot",
genecol="gray20", fitcol="red",
finalcol="dodgerblue3"
)
```
![](`r knitr::fig_chunk('dispersion_plot', 'png')`)
The dispersion plot shows the expected behavior. At low gene counts, the dispersion is high ***(around 1e+00)*** and tends to decrease at higher gene counts, ultimately becoming stable ***(around 1e-02)***.
## Differential Expression Testing
Using DESeq2, we conduct differential expression testing to identify genes that exhibit significant changes in expression levels between experimental conditions. We set a significance threshold (alpha) and perform statistical tests to determine differential expression, taking into account factors such as fold change and adjusted p-values.
```{r class.source="code", class.output="code"}
# DESeq2 Result
res05 <- results(prdds, alpha = 0.05)
res05 <- na.omit(res05)
```
```{r class.source="code", class.output="code"}
# Order by adjusted p-value
res05ordered <- res05[order(res05$padj),]
head(as.data.frame(res05ordered))
```
## Visualization
We visualize the results of our analysis using various plots, including MA plots, volcano plots, and heatmaps. These visualizations help us identify differentially expressed genes (DEGs) and gain insights into the biological significance of the findings.
### MA Plot
The MA plot shows the relationship between the mean of normalized counts (average expression level) and the log2 fold change (log2FC) for each gene. A positive log2FC indicates that the gene is upregulated in the second condition, while a negative log2FC indicates that the gene is downregulated.
The grey dots represent genes that are not statistically significant (adjusted p-value > 0.05), while the blue dots represent genes that are statistically significant (adjusted p-value <= 0.05).
```{r ma_plot, dev='png', fig.show='hide', class.source="code", class.output="code"}
# MA Plot
DESeq2::plotMA(
res05,
main="Sedentary vs Exercise, alpha=0.05",
ylim=c(-5,10),
cex=0.5,
colNonSig=adjustcolor("gray20", alpha.f=0.5),
colSig=adjustcolor("dodgerblue3", alpha.f=0.5)
)
abline(h = 1, col = '#ff0000' , lwd = 1)
abline(h = -1, col= '#ff0000', lwd = 1)
```
![](`r knitr::fig_chunk('ma_plot', 'png')`)
Based on the plot, it appears that there are **more genes that are up-regulated and few genes are down-regulated in the exercise condition compared to the sedentary condition**. This is because there are a lot of blue dots in the positive log2 fold change region of the plot, and very few blue dots in the negative log2 fold change region.
### Volcano Plot
The x-axis of a volcano plot shows the log2 fold change of a gene, which is the magnitude of the change in expression between the two conditions. The y-axis shows the -log10 of the adjusted p-value, which is a measure of the statistical significance of the change in expression. Genes with higher fold changes and lower p-values are considered to be more differentially expressed.
```{r volcano_plot, dev='png', fig.show='hide', class.source="code", class.output="code"}
# Volcano Plot
res05$gene_status <- ifelse(
res05$padj < 0.05,
ifelse(
res05$log2FoldChange > 1,
"Up-Regulated",
ifelse(
res05$log2FoldChange < -1,
"Down-Regulated",
"Non-significant"
)
),
"Non-significant"
)
ggplot(
res05,
aes(x = log2FoldChange, y = -log10(padj), color = factor(gene_status))
) +
geom_point(size = 1, alpha = 0.7) +
scale_color_manual(values = brewer.pal(3, "Set1")) +
theme_minimal() +
ggtitle("Volcano Plot of Differentially Expressed Genes") +
xlab("log2 Fold Change") +
ylab("-log10(Adjusted p-value)") +
theme(legend.title = element_blank()) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed")
```
![](`r knitr::fig_chunk('volcano_plot', 'png')`)
The red dots represent genes that are statistically significantly downregulated, the green dots represent genes that are statistically significantly upregulated, and the blue dots represent genes that are not statistically significant.
# Results
## Identification of Significant Differentially Expressed (DE) Genes
We can now identify significant differentially expressed genes based on statistical analysis. Genes with an adjusted p-value (padj) less than 0.05 and an absolute log2 fold change greater than 1 are considered significant DE genes.
### Significant Differentially Expressed Genes
```{r class.source="code", class.output="code"}
sig_genes <- as.data.frame(res05[res05$padj < 0.05 & abs(res05$log2FoldChange) > 1, ])
head(sig_genes)
```
We have identified a subset of genes that exhibit significant changes in expression levels under the experimental conditions. Let's proceed by separating these genes into upregulated and downregulated categories.
### Significant Up Regulated Genes
Genes displaying a positive log2FoldChange are classified as upregulated, indicating an increase in expression levels following exercise training
```{r class.source="code", class.output="code"}
up_genes <- subset(sig_genes, log2FoldChange > 0)
head(up_genes)
```
### Significant Down Regulated Genes
Genes exhibiting a negative log2FoldChange are categorized as downregulated, suggesting a decrease in expression levels post-exercise.
```{r class.source="code", class.output="code"}
down_genes <- subset(sig_genes,log2FoldChange < 0)
head(down_genes)
```
## Heatmap of top differentially expressed genes
:::: {style="display: flex; gap: 36px"}
::: {}
### Top 10 Up Regulated Genes
```{r upreg_heatmap, dev='png', fig.show='hide', class.source="code", class.output="code"}
top_up <- head(up_genes[order(up_genes$log2FoldChange, decreasing = TRUE), ], 10)
top_up_exp <- assay(rld)[rownames(top_up), ]
pheatmap(top_up_exp,
cluster_rows = FALSE,
cluster_cols = FALSE,
scale = "row",
show_colnames = TRUE,
col=brewer.pal(name="RdBu", n=11),
main = "Top Up Regulated Genes Heatmap")
```
![](`r knitr::fig_chunk('upreg_heatmap', 'png')`)
:::
::: {}
### Top 10 Down Regulated Genes
```{r downreg_heatmap, dev='png', fig.show='hide', class.source="code", class.output="code"}
top_down <- head(down_genes[order(down_genes$log2FoldChange, decreasing = TRUE), ], 10)
top_down_exp <- assay(rld)[rownames(top_down), ]
pheatmap(top_down_exp,
cluster_rows = FALSE,
cluster_cols = FALSE,
scale = "row",
show_colnames = TRUE,
col=brewer.pal(name="RdBu", n=11),
main = "Top Down Regulated Genes Heatmap")
```
![](`r knitr::fig_chunk('downreg_heatmap', 'png')`)
:::
::::
# Conclusion
```{r class.source="code", class.output="code"}
n_total <- nrow(COUNTS)
n_de_genes <- nrow(sig_genes)
n_up_genes <- nrow(up_genes)
n_down_genes <- nrow(down_genes)
```
Our analysis of the [GSE227516 Dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE227516){.uri} provided insights into the molecular changes occurring in pancreatic islet function following exercise training in mice.
We analyzed a total of **`r n_total` genes**, out of which **`r n_de_genes` genes** exhibited significant differential expression post-exercise. This means their expression levels were significantly different between the compared groups. Among the DE genes, **`r n_up_genes` genes** were upregulated, indicating increased expression levels,while **`r n_down_genes` genes** were downregulated, suggesting decreased expression levels.
We can now save these significant genes along with their regulation status in a separate file.
```{r class.source="code", class.output="code"}
# Combine significant genes with their categories
sig_genes$gene_status <- ifelse(sig_genes$log2FoldChange > 0, "Up-Regulated", "Down-Regulated")
sig_genes$gene_id <- rownames(sig_genes)
# Write the data frame to a file
if (!file.exists("result")) {
dir.create("result")
}
write.table(sig_genes, file = "result/significant_DE_genes.csv", sep = ",", row.names = FALSE)
```
# Session Summary
```{r class.source="code", class.output="code"}
sessionInfo()
```