-
Notifications
You must be signed in to change notification settings - Fork 0
/
tximeta-gene-range.Rmd
260 lines (217 loc) · 7.53 KB
/
tximeta-gene-range.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# Gene ranges in tximeta
Objective: explore how genes are located in the genome during
summarization with tximeta.
The tximeta package [@Love2020] imports gene expression quantification
into Bioconductor. The package is built around a few main ideas:
1. transcript-level quantification improves accuracy even for
gene-level inference [@Soneson2015]
2. summarization to gene level is a common use case
3. reference sequence hashes (think of a "barcode" that uniquely
identifies the cDNA sequences) can be used to identify the
provenance and release information of annotation
tximeta uses reference sequence hashes embedded in metadata files
provided with Salmon [@Patro2017] quantification output to
automatically attach genomic range information, transcript-to-gene
mapping, and genome build information to SummarizedExperiments. This
is done automatically, on behalf of the user, without needing to
specify the provenance or release numbers used during quantification.
tximeta can also be used in combination with de novo transcriptomes or
less common transcriptomes (see @Love2020 or package vignette).
When data is summarized to the gene level, tximeta by default uses the
`genes()` function to locate the genes in the genome. This provides a
genomic range which starts with the leftmost position of any isoform
and ends with the rightmost position of any isoform. This however may
not be a good representation of the locus of transcription.
In this chapter, we explore an alternative way to assign ranges in
tximeta's summarization step, which is based on isoform abundance. We
use plyranges and plotgardener packages to explore the differences
between these two options for range assignment.
Data description: the *macrophage* package provides Salmon
quantification files for a set of 24 RNA-seq samples from
@Alasoo2018. Data for six female human donors is available in the
package, with gene expression at baseline, after IFN-gamma
stimulation, after exposure to *Salmonella*, and after a combination
of the two treatments. We won't focus on the design of the experiment
itself, but we will use transcript-level quantification data to
explore the differences between the two range assignment options.
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
The following code chunk creates a `coldata` table describing the
samples.
```{r message=FALSE}
library(macrophage)
library(dplyr)
library(tximeta)
dir <- system.file("extdata", package="macrophage")
coldata <- read.csv(file.path(dir, "coldata.csv")) %>%
mutate(files = file.path(dir, "quants", names, "quant.sf.gz"),
condition = factor(condition_name),
condition = relevel(condition, "naive")) %>%
select(files, names=sample_id, condition, line_id) %>%
slice(1:4) # subset to a few samples
coldata %>%
dplyr::select(-files)
```
Importing the transcript-level data as a SummarizedExperiment:
```{r message=FALSE}
se <- tximeta(coldata, dropInfReps=TRUE)
```
We can then see the ranges of the transcripts and the genome build
information:
```{r message=FALSE}
library(SummarizedExperiment)
rowRanges(se)
seqinfo(se)
```
The default approach to assign ranges during summarization
(`rowRanges` will represent the total extent of all isoforms of the
gene):
```{r}
gse_default <- summarizeToGene(se)
```
And an alternative method to assign ranges by isoform abundance
(`rowRanges` will represent the extent of the most abundant isoform):
```{r}
gse <- summarizeToGene(se, assignRanges="abundant")
```
Note that these two steps provide identical counts, abundance,
etc. They only differ by their `rowRanges`.
```{r}
all.equal(assay(gse_default), assay(gse))
```
We can now compare the 5' location of the ranges using the two
approaches.
```{r message=FALSE}
library(plyranges)
# plyranges to locate the 5' start using default approach
default_5p <- gse_default %>%
rowRanges() %>%
anchor_5p() %>%
mutate(width=1) %>%
start()
```
Pull out information from the SummarizedExperiment: the average
abundance over the samples, and the absolute distance between the 5'
start of the new and the old approach.
```{r}
gene_dat <- gse %>%
rowRanges() %>%
anchor_5p() %>%
mutate(width=1) %>%
mutate(
ave_tpm = rowMeans(assay(gse, "abundance")),
dist = abs(start - default_5p))
# how many genes have same 5' start?
table(gene_dat$dist == 0)
```
While most genes have the same 5' start, a good number have changed
their start position.
We can summarize the extent of changes by gene expression:
```{r tss-distance-by-tpm}
library(ggplot2)
gene_dat %>%
as_tibble() %>%
filter(dist > 0) %>%
mutate(
log10_dist = cut(
log10(dist),
breaks=c(0,3:6,Inf),
include.lowest=TRUE),
tpm = cut(
ave_tpm,
breaks=c(0,1,10,100,Inf),
include.lowest=TRUE)
) %>%
group_by(log10_dist, tpm) %>%
tally() %>%
ggplot(aes(log10_dist, n)) +
geom_col() +
ggtitle("change in 5' position of gene") +
ylab("number of genes") +
facet_wrap(~tpm, labeller=label_both)
```
Considerations:
* The isoform-abundance based representation of the gene is likely a
much more accurate description of the locus of transcription than
the total gene extent.
* The isoform-abundance 5' start can be very far away from the total
gene extent 5' start.
* Gene-level summary is an incomplete representation of transcription,
and perhaps it may be better to perform analysis on the transcript
level, e.g. DTE (see `swish` in fishpond package or
`catchSalmon/catchKallisto` in edgeR) or DTU (see `DEXSeq`,
`satuRn`, `diffSplice` in limma package, @Segers2023, or the
rnaseqDTU workflow).
We conclude this chapter by diagramming the difference between the two
approaches for a single gene, using plotgardener [@plotgardener].
We begin by building a custom genome assembly for plotgardener
functions. Note the use of `tximeta::retrieveDb` to pull the exact
TxDb used in quantification of the data for representation of the
ranges in the plots.
```{r message=FALSE}
library(plotgardener)
library(org.Hs.eg.db)
# retrieves the correct TxDb/EnsDb for use in plots, etc.
txdb <- retrieveDb(gse)
new_assembly <- assembly(
Genome = "hg38",
TxDb = txdb,
OrgDb = org.Hs.eg.db,
gene.id.column = "GENEID",
display.column = "GENEID",
BSgenome = NULL
)
```
We pick a random gene with a large distance and high expression:
```{r}
set.seed(5)
gene <- gene_dat %>%
filter(strand == "-", dist > 1e5, ave_tpm > 100) %>%
slice(sample(n(),1))
```
The following sets up the parameters for the following plotgardener
command.
```{r}
par <- pgParams(
chrom = seqnames(gene) %>% as.character(),
chromstart = round((start(gene) - 5e5) / 1e5) * 1e5,
chromend = round((end(gene) + 5e5) / 1e5) * 1e5,
assembly = new_assembly,
just = c("left", "bottom")
)
```
Examine the isoform proportions for the top 5 isoforms of this gene:
```{r}
props <- gene$iso_prop[[1]] %>%
sort(decreasing=TRUE) %>%
head(5)
props
```
Highlight the gene, and the top 5 isoforms in the following plot.
```{r}
gene_hilite <- data.frame(
gene=gene$gene_id,
color="magenta"
)
txp_hilite <- data.frame(
transcript=names(props),
color=rep(c("dodgerblue","darkgoldenrod"),c(1,length(props)-1))
)
```
Building the plotgardener plot:
```{r tximeta-tss-plot, message=FALSE}
pageCreate(width = 5, height = 4, showGuides = FALSE)
plotGenes(
params = par, x = 0.5, y = 3.5, width = 4, height = 1,
geneHighlights = gene_hilite
)
plotTranscripts(
params = par, x = 0.5, y = 2.5, width = 4, height = 2.5,
transcriptHighlights = txp_hilite, fill="grey90"
)
plotGenomeLabel(
params = par, x = 0.5, y = 3.5, length = 4,
just = c("left", "top")
)
```