-
Notifications
You must be signed in to change notification settings - Fork 0
/
bioc-intro.Rmd
533 lines (386 loc) · 20 KB
/
bioc-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
---
title: "Bioconductor Introduction"
author: "Alistair Martin <br> (based on Mark Dunning's 2017 slides)"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_document:
toc: yes
toc_float: true
df_print: paged
---
# About R
If you haven't learned the basics of R prior to attending this course, you should check out our [R crash course](https://bioinformatics-core-shared-training.github.io/r-crash-course/) for an overview of R's syntax. It's also a great refresher if you feel it has been a while since you last worked with R.
# About the practicals for this workshop
- The traditional way to enter R commands is via opening a Terminal or, or using the console in RStudio (bottom-left panel when RStudio opens for first time).
- For this course we will instead be using a relatively new feature called *R Notebooks*.
- An R notebook mixes plain text written in markdown with "chunks" of R code.
*Markdown* is a very simple way of writing a template to produce a pdf, HTML or word document. For example, the compiled version of this document is available online and is more convenient to browse [here](https://bioinformatics-core-shared-training.github.io/cruk-autumn-school-2017/Day1/bioc-intro.nb.html).
- "Chunks" of R code can be added using the *insert* option from the tool bar, or the CTRL + ALT + I shortcut
- Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
- Or you can execute the whole chunk by pressing CTRL + SHIFT + ENTER
- Or you can press the green triangle on the right-hand side to run everything in the chunk
- The code might have different options which dictate how the output is displayed in the compiled document (e.g. HTML)
+ e.g. you might see `EVAL = FALSE` or `echo = FALSE`
+ you don't have to worry about this if stepping through the markdown interactively
```{r}
print("Hello World")
```
When viewing the R notebooks directly, not the compiled documents, sections may have additional characters so as to format them nicely when compiled. This is markdown, a simple language that formats the text whle, crucially, being still readable in the raw state.
For example:
*This will be displayed in italic*
**This will be displayed in bold**
- this
- is
- a
- list
+ this is a *sub-list*
You can also add hyperlinks, images and tables.
Lastly, you can even embed chunks of code written in other programming languages.
```{python}
a='Wow python'
print(a.split())
```
More help is available through RStudio **Help -> Markdown Quick Reference** or you can view a cheat sheet [here](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf).
To create markdown files for your own analysis; File -> New File -> R Markdown...
# About the Bioconductor project
![](http://bioconductor.org/images/logo_bioconductor.gif)
Established in 2001, Bioconductor provided a convenient method to distribute tools for the analysis and comprehension of high-throughput genomic data in R. Initially focused on microarrays, Bioconductor now has packages (read: software) to process data obtained from most modern data sources.
- R is rarely used for the primary processing of modern data
+ R is far slower than many other programming languages due to it being an interpreted language ([Interpreted vs Compiled](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zappldev/zappldev_85.htm))
+ R is extensively-used for visualisation, interpretation and inference once data has been parsed into a more manageable form, e.g., a csv.
On the [Bioconductor website](www.bioconductor.org), you will find
- Installation instructions
- [Course Materials](http://bioconductor.org/help/course-materials/)
- [Support forum](https://support.bioconductor.org/)
+ a means of communicating with developers and power-users
+ Stack Overflow tends to get faster/better responses [](https://bioinformatics.stackexchange.com/)
- [Example datasets](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData)
- [Annotation Resources](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData)
- Conferences
For this session, we will introduce the Bioconductor project as a means of analysing high-throughput data
# Installing a package
All Bioconductor software packages are listed under
- bioconductor.org -> Install -> Packages -> Analysis *software* packages
+ Many thousands of packages have been added over the years, so I would suggest just googling "bioconductor [package_name]"
+ e.g. [edgeR landing page](http://bioconductor.org/packages/release/bioc/html/edgeR.html)
- installation instructions are given, which involves running the `install` command from the BiocManager package
+ this will install and update any additional dependencies
+ Running `BiocManager::install` forgoes needing to load the package manager into the work enviroment
- you only need to run this procedure once for each version of R
```{r eval=FALSE}
## You don't need to run this, edgeR should already be installed for the course
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
```
Once installed, a Bioconductor package can be loaded in the usual way with the `library` function. All packages are required to have a *vignette* which gives detailed instructions on how to use the package and the workflow of commands. Some packages such as `edgeR` have very comprehensive *user guides* with lots of use-cases.
```{r message=FALSE,eval=FALSE}
library(edgeR)
vignette("edgeR")
edgeRUsersGuide()
```
Package documentation can also be accessed via the *Help* tab in RStudio, which can also be invoked in the console using "?"
```{r message=FALSE,eval=FALSE}
?edgeR
```
# Structures for data analysis
Complex data structures are used in Bioconductor to represent high-throughput data, but we often have simple functions that we can use to access the data. We will use some example data available via Bioconductor to demonstrate how high-throughput data can be represented, and also to review some basic concepts in data manipulation in R.
- This data is from a *microarray* experiment. We will be concentrating on more modern technologies in this class, but most of the R techniques required will be eaxctly the same or at least very similar.
- [experimental data](http://bioconductor.org/packages/release/BiocViews.html#___ExperimentData) packages are available through Bioconductor, and can be installed in the way we just described
+ the package should already be installed on your computer, so you won't need to run this.
```{r eval=FALSE}
## No need to run this - for reference only!
BiocManager::install("breastCancerVDX")
```
To make the dataset accessible in R, we first need to load the package. If we navigate to the documentation for `breastCancerVDX` in RStudio, we find that it provides an object called `vdx` which we load into R's memory using the `data` function.
```{r message=FALSE}
library(breastCancerVDX)
data(vdx)
```
The object `vdx` is a representation of breast cancer dataset that has been converted for use with standard Bioconductor tools. The package authors don't envisage that we will want to view the entire dataset at once, so have provided a number of ways to interact with the data
- typing the name of the object provides a summary, e.g.,
+ how many genes in the dataset
+ how many samples
```{r, message=FALSE}
vdx
```
# Accessing expression values
The expression values can be obtained by the `exprs` function:
- remember, `<-` is used for assignment to create a new variable
- the data are stored in a `matrix` in R
+ it is a good idea to check the dimensions using `dim`, `ncol`, `nrow` etc.
```{r}
eValues <- exprs(vdx) # also found at vdx@assayData$exprs
class(eValues)
dim(eValues)
ncol(eValues)
nrow(eValues)
```
- the row names are the manufacturer-assigned ID for a particular probe
- the column names are the identifiers for each patient in the study
- each entry is a *normalised* log$_2$ intensity value for a particular gene in a given sample
+ we won't talk about normalisation here, but basically the data has been transformed so that samples and/or genes can be compared
- subsetting a matrix is done using the `[row, column]` notation
+ the function `c` is used to make a one-dimensional *vector*
+ the shortcut `:` can used to stand for a sequence of consecutive numbers
```{r}
eValues[c(1,2,3),c(1,2,3,4)]
eValues[1:3,1:4]
```
- subsetting can be chained together
- we can also omit certain rows or columns from the output by prefixing the indices with a `-`
```{r}
eValues[1:3,1:4][,2:3]
eValues[1:3,1:4][,-(2:3)]
```
# Simple visualisations
The most basic plotting function in R is `plot`
- using the `plot` function with a vector will plot the values of that vector against the index
+ what do you think is displayed in the plot below?
```{r}
plot(eValues[1,])
```
- one possible use is to compare the values in a vector with respect to a given factor
- but we don't know the clinical variables in our dataset yet (to come later)
- a boxplot can also accept a matrix or data frame as an argument
+ what do you think the following plot shows?
```{r fig.width=12}
boxplot(eValues,outline=FALSE)
```
# Accessing patient data
The *metadata*, or phenotypic data, for the samples is retrieved using the `pData` function.
```{r}
metadata <- pData(vdx) #vdx@phenoData@data
head(metadata)
```
- head prints only the first few rows/values.
+ you can specificy the exact number of rows to output, e.g., `head(metadata,10)`
- individual columns can be accessed using the `$` notation.
- columns are returned as a *vector*, which can be fed into other standard plotting and analysis functions.
- *auto-complete* is available in RStudio; once you type the `$`, it should give you a list of possible options.
```{r}
head(metadata$samplename,15)
```
******
******
******
## Exercise
- what type of R object is used to store the metadata?
- why is this different to the object used for the expression values?
- use the square bracket notation `[]` to print
+ the first 10 rows of the metadata object, first five columns
+ last 10 rows of the metadata object, columns 7 to 9
- visualise the distribution of the patient ages using a histogram
- calculate the average age of patients in the study with the `mean` function
+ what do you notice about the result?
+ can you change the arguments to mean to get a more meaningful result
******
******
******
```{r}
#FILL IN SOLUTIONS HERE
```
So far we have been able to print out a subset of our data by specifying a set of numeric indices (e.g. first 10 rows etc). Lets say we're interested in high-grade tumours, in which case we might not know in advance which rows these correspond to
- `==` `>`, `<`, `!=` can used to make a *logical comparison*
- result is a `TRUE` or `FALSE` indicating whether each entry satisfies the test condition, or not.
+ however, if a particular entry in the vector is `NA` the resulting logical vector will have `NA` at the same positions
- Multiple comparisons can be combined with the logic operators *and* (&) and *or* (|)
+ There's also *not* (!)
```{r}
table(metadata$grade == 3)
table(metadata$grade == 3,useNA="ifany")
```
Such a logical vector can then be used for subsetting
- `which` can be used to make sure there are no `NA` values in the logical vectors
+ it gives the numeric indices that correspond to `TRUE`
- here, we don't specify any column indices inside the `[]`
+ R will print all the columns
+ however, don't forget the ,
+ if you do, R will still try and do something. It will almost certainly be not what you expected
```{r}
metadata[which(metadata$grade == 3),]
```
Can use same expression to subset the columns of the expression matrix
- why can we do this? Because the *columns* of the expression matrix are in the same order as the *rows* of the metadata
+ don't believe me? see below...
- this isn't a coincidence. the data have been carefully curated to ensure that this is the case
- data stored in online repositories are often organised in this way
```{r eval=FALSE}
head(colnames(eValues))
head(rownames(metadata))
table(colnames(eValues) == rownames(metadata))
```
- we can subset the expression data according to our clinical data
```{r}
eValues[,which(metadata$grade==3)][1:10,1:10]
```
- in fact, we can subset the entire `vdx` object by sample subsets if we wish
```{r}
vdx[,which(metadata$grade==3)]
```
Previously, we used a boxplot to visualise the expression levels of all genes in a given sample to look for trends across the dataset. Another use for a boxplot is to visualise the expression level of a particular gene with respect to the sample metadata
- we can extract the column of interest with a `$` and use the *formula* syntax
+ `table` in this case will tell us how many observations of each category are present
- R will be clever and convert the factor into a `factor` type if required
```{r}
fac <- metadata$er
table(fac)
boxplot(eValues[1,] ~ fac,
xlab="ER Status",
ylab="Expression Level",
col=c("steelblue","goldenrod"))
```
Performing a test to assess significance follows similar syntax
- `t.test` is the generic function to perform a t-test, and can be adapted to different circumstances
+ e.g., if our data are paired, or not
+ see the help for `t.test` for more details
- for now, we will gloss over testing assumptions on the data such as requiring a *normal* (Gaussian) distribution and multiple testing correction
```{r}
t.test(eValues[1,]~fac)
```
# Accessing feature (gene) information
We could be lucky, and the first row in the expression matrix could be our favourite gene of interest! However, this is unlikely to be the case and we will have to figure out which row we want to plot
- we can use another aspect of the `vdx` object; the *feature data*
- there is a handy `fData` function to access these data
- again, this gives us a *data frame*
- this is a pre-built table supplied with the dataset
+ later in the course we will look at using online services and databases for annotation and converting between identifiers
```{r}
features <- fData(vdx) #try to figure out the location in memory of this data.frame via @ and $
class(features)
head(features[,1:5])
colnames(features)
```
As we know, gene symbols (more-common gene names) can be accessed using the `$` syntax; returning a vector
```{r}
head(features$Gene.symbol)
```
We could inspect the data frame manually (e.g. using `View(features)`) and identify the row number corresponding to our gene of interest. However, as aspiring R programmers, there is a better way
- `==` to test for exact matching
- `match` will return the *index* of the first match
- `grep` can be used for partial matches
- each of the above will give an *vector* that can be used to subset the expression values
+ Why do we get different results from the various commands?
```{r}
which(features$Gene.symbol == "BRCA1")
match("BRCA1",features$Gene.symbol)
grep("BRCA1",features$Gene.symbol)
grep("BRCA",features$Gene.symbol)
```
******
******
******
## Exercise
- Verify that the rows of the feature matrix and the expression values are in the same order
- Find the row corresponding to your favourite gene
+ if you don't have one, try `ESR1`
+ if you find multiple matches, pick the first one that occurs
- Does the expression level of this gene appear to be associated with the ER status?
******
******
******
```{r}
#FILL IN SOLUTIONS HERE
```
# Testing all genes for significance
Later in the course we will see how to execute a *differential expression* analysis for RNA-seq data, and discuss some of the issues surrounding this. For now we will perform a simple two-sample t-test for all genes in our study, and derive a results table
- firstly, we load the `genefilter` package which has a very convenient function for performing many t tests in parallel
- `rowttests` will run a t-test for each row in a given matrix and produce an output table
+ `statistic`; test statistic
+ `dm`; difference in means
+ `p.value`; the p-value
- `rowttests` expects a *factor* as the second argument, so we have to use `as.factor`
+ as usual, we can get help by doing `?rowttests`
```{r}
library(genefilter)
tstats <- rowttests(eValues, as.factor(metadata$er))
head(tstats)
hist(tstats$statistic)
```
The rows are ordered in the same way as the input matrix
- to change this to increasing significance we can use the `order` function
- when given a vector, `order` will return a vector of the same length giving the permutation that rearranges that vector into ascending or descending order
- The `sort` function shortcuts using the output of `order` to rearrange the original vector by returning the sorted vector
```{r}
x <- c(9, 3, 4, 2, 1, 6,5, 10, 8, 7)
x
order(x)
x[order(x)]
sort(x)
```
- so if we want to order by p-value we first use order on the p-value vector
- this can then be used to re-arrange the rows of the table
```{r}
head(tstats[order(tstats$p.value,decreasing = FALSE),])
```
However, the table we get isn't immediately useful unless we can recognise the manufacturer probe IDs
- to provide extra annotation to the table, we can *column bind* (`cbind`) the columns of test statistic with those from the feature matrix
- be careful though, we can only do this in cases where the rows are in direct correspondence
```{r}
table(rownames(tstats) == rownames(features))
tstats.annotated <- cbind(features[,c("Gene.symbol","EntrezGene.ID","Chromosome.location")], tstats)
head(tstats.annotated)
```
Now when we order by p-value, the extra columns that we just added allow us to interpret the results more easily
```{r}
tstats.ordered <- tstats.annotated[order(tstats$p.value,decreasing = FALSE),]
head(tstats.ordered)
```
We can also query this table to look for our favourite genes of interest
- `%in%` is a simplified way to perform matches to multiple items in a vector
```{r}
tstats.ordered[grep("ESR1",tstats.ordered$Gene.symbol),]
tstats.ordered[tstats.ordered$Gene.symbol %in% c("ESR1","GATA3","FOXA1"),]
```
******
******
******
## Exercise
- From the annotated table above, select all genes with p-values less than 0.05
- Write this data frame as a `csv` file (hint: use `write.csv`)
+ Check the outputted file. Is there anything weird about it?
- Use the `p.adjust` to produce a vector of p-values that are adjusted. Add this as an extra column to your table of results and write as a file
******
******
******
```{r}
#FILL IN SOLUTIONS HERE
```
# Using your own data
The object `vdx`, which contains the data we've been using, is an ExpressionSet object (check using the `class` function). ExpressionSet's are used as input in a variety of bioinformatic packages and pipelines. As we found above, an ExpressionSet usually contains the following data:
- expression data
- meta data (optional)
- feature data (optional)
Let's generate some data and create our own ExpressionSet from scratch
- `rnorm` generates normally distrubruted values
- `matrix` takes a vector and arranges it into the specified number of rows and columns
- `rep` is used to repeat a vector or its elements
- `sample` *samples* the input vector, with or without replacement
- `AnnotatedDataFrame` is a special container that conveniently stores and manipulates data
- Remeber that you can use help (`?`) to understand unknown functions
```{r}
eData <- matrix(rnorm(100000),nrow = 1000,ncol = 100)
mData <- AnnotatedDataFrame(data.frame(Sex=rep(c("M","F"),each=50),biomarker.X=rnorm(100)))
fData <- AnnotatedDataFrame(data.frame(Gene=1:1000,Chr=sample(1:23,1000,replace = T)))
exampleSet <- ExpressionSet(
assayData=eData,
phenoData=mData,
featureData=fData
)
```
******
******
******
## Exercise
- Are any genes differentially expressed in our toy dataset with respect to gender?
+ Hopefully these dissapear if you correct for multiple testing (`p.adjust`)
- Now perform the following:
+ run `eData[sample(1:1000,1),1:50] <- rnorm(50) + 1`
+ remake the ExpressionSet
- Using what you know from above, can you identify which gene was randomly increased in men?
******
******
******
```{r}
#FILL IN SOLUTIONS HERE
```