-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathTSRchitectUsersGuide.Rmd
526 lines (391 loc) · 26.5 KB
/
TSRchitectUsersGuide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
---
title: "TSRchitect User's Guide"
author: "R. Taylor Raborn"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{TSRchitect User's Guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# TSRchitect User's Guide
## R. Taylor Raborn and Volker P. Brendel
## Department of Biology, Indiana University
## First edition 25 January 2017
## Updated 10 July 2018
TSRchitect is an R package for analyzing diverse types of high-throughput transcription start site (TSS) profiling datasets.
In recent years, large-scale TSS profiling data has characterized the landscape of transcription initiation at high resolution, identifying promoter architecture in a number of eukaryotic model systems, including human, mouse, fruit fly and worm.
TSRchitect can handle TSS profiling experiments that contain either single-end or paired-end sequence reads.
Examples of TSS profiling data types that TSRchitect is capable of handling are:
- CAGE (Cap Analysis of Gene Expression) [Single-end]
- PEAT (Paired-end Analysis of Transcription) [Paired-end]
- RAMPAGE (RNA Annotation and Mapping of Promoters for Analysis of Gene Expression) [Paired-end]
- CapSeq [Single-end]
- TSS-seq [Single-end]
- STRIPE-seq (Survey of Transcription Initiation at Promoters and Enhancers) [Single-end or Paired-end]
TSRchitect provides the capability to efficiently identify putative promoters---which we call transcription start regions (TSRs)---from TSS profiling experiments.
TSRchitect can accommodate multiple datasets, including biological replicates and multiple tissues/conditions, that were generated in a variety of model organisms and genome assemblies, requiring only aligned TSS profiling information (in BAM format) as the initial input. To aid the downstream analysis of identified promoters, TSRchitect calculates a variety of TSR properties that have been shown to be associated with promoter architecture, including TSR activity, width, Shape Index (SI), modified Shape Index (mSI) and torque.
Finally, TSRchitect's output is compatible with popular differential expression software such as [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), assisting in downstream analysis to identify TSRs that are differentially active in one sample versus another. In addition to this vignette, the TSRchitect Reference Manual is available as part of the package's online documentation. Please note that this document will guide you through a few examples of analysis; if you're intrested in a detailed discussion of individual functions, please see the TSRchitect Reference Manual, which is found in `man/TSRchitect.pdf` and online [here](https://bioconductor.org/packages/release/bioc/manuals/TSRchitect/man/TSRchitect.pdf).
## Getting started
In your current directory, create subdirectories as follows on the command line.
```{bash eval=FALSE}
mkdir downloads
cd downloads
mkdir HsRAMPAGEbam
```
Now that this is complete, we proceed with the first of the three examples contained in this vignette.
## Example 1: Identifying promoters from RAMPAGE data derived from two human cell lines.
RAMPAGE is a TSS profiling method that identifies promoters at large-scale using a cap-based library construction method that is adapted for paired-end sequencing. Developed recently by Batut and Gingeras (2013), it has become a popular method for promoter identification and is currently part of the data compredium in the latest edition of the ENCODE project.
In this example we will process RAMPAGE data derived from two immortalized human cell lines with TSRchitect. The experiments selected for this vignette are part of the ENCODE project and are publically available online at the [ENCODE Experiment matrix](https://www.encodeproject.org/matrix/?type=Experiment). The two samples come from HT1080 cells, which is a well-characterized fibrosarcoma cell line, and NCI-H460 cells, which are derived from a large cell lung carcinoma in a male patient.
To begin, we must first download the RAMPAGE datasets (which were aligned to GRCh38 and are in BAM format) to our local system. To accomplish this we will utilize the "ENCODExplorer" package, which is part of the Bioconductor suite. More information on ENCODExplorer package can be found at the following link: https://www.bioconductor.org/packages/release/bioc/html/ENCODExplorer.html.
Now we can proceed (in the R console) with downloading the data:
```{r eval=FALSE}
#Downloading the files:
library(ENCODExplorer)
data(encode_df, package="ENCODExplorer")
datasets <- fuzzySearch(searchTerm=c("ENCFF214GWH", "ENCFF265SGZ", "ENCFF242UWH", "ENCFF348EKW"), database=encode_df, filterVector=c("file_accession"), multipleTerm=TRUE)
downloadEncode(datasets, df=encode_df, format="bam")
```
Once the above steps are complete, we will move the files into the subdirectory `downloads/` to keep a record of data provenance.
To use TSRchitect in this example, we coveniently use symbolic links to give more intuitive names to the datasets (as well as ordering them in proper order; see below):
```{bash eval=FALSE}
#The following are to be executed in your shell terminal:
mv *.bam downloads
cd downloads
cd HsRAMPAGEbam
ln -s ../downloads/ENCFF214GWH.bam H460-rep1.bam
ln -s ../downloads/ENCFF265SGZ.bam H460-rep2.bam
ln -s ../downloads/ENCFF242UWH.bam HT1080-rep1.bam
ln -s ../downloads/ENCFF348EKW.bam HT1080-rep2.bam
cd ..
```
Please download and uncompress the human gene annoation file found at the following location:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz.
Please move the uncompressed file into the downloads/ directory, e.g. with
```{bash eval=FALSE}
cd downloads
wget --content-disposition ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz
gunzip gencode.v19.annotation.gff3.gz
cd ..
```
Now that we have our input and annotation files prepared, we can load TSRchitect into our R workspace:
```{r eval=FALSE}
#loading TSRchitect
library(TSRchitect)
```
Next we'll initialize our dedicated S4 object---called the tssObject---on which TSRchitect's
functions are applied, using `loadTSSobj`. We also need to supply other information about the
experiment as arguments. This will attach GenomicAlignments objects (representing the four bam files in this example) to your tssObject.
Note that we must specify `isPairedBAM=FALSE` because this is single-end CAGE data.
```{r eval=FALSE}
Hs_RAMPAGE <- loadTSSobj(experimentTitle ="Human RAMPAGE", inputDir="HsRAMPAGEbam/", /
isPairedBAM=TRUE, sampleNames=c("H460-rep1", /
"H460-rep2", "HT1080-rep1", "HT1080-rep2"), replicateIDs=c(1,1,2,2))
```
Next we need to provide the sample names and specify which samples are biological replicates.
In this case we are working with 4 total datasets and 2 samples in duplicate. Please note that, because the alignments on our `bamData` slot are organized in ascending alphabetical order (as are the file names on the `fileNames` slot), we must provide our identifiers in `sampleNames` and `replicateIDs` to directly correspond to this.
To check this on the tssObject [S4 object](http://adv-r.had.co.nz/S4.html) you have created, simply check the list of .bam files as follows using one of TSRchitect's accessor methods.
```{r eval=FALSE}
#obtaining a list of bam files loaded to the tssObject S4 object
getFileNames(experimentName=Hs_RAMPAGE)
```
Now that the alignment files have been imported and attached to our tssObject S4 object, we continue by computing the TSSs from the alignment (in this case .bam) files.
```{r eval=FALSE}
Hs_RAMPAGE <- inputToTSS(experimentName=Hs_RAMPAGE)
```
Next we will calculate the abundance of each tag in our TSS datasets. TSRchitect provides the option to run this function in parallel, so for this example we chose to run this on 4 cores. Please adjust this as your compute resource permits.
```{r eval=FALSE}
Hs_RAMPAGE <- processTSS(experimentName=Hs_RAMPAGE, n.cores=4, tssSet="all", writeTable=TRUE)
```
Since we specified 'writeTable=TRUE', files (entitled "TSSset-1.txt" to "TSSset-4.txt") containing TSS abundance will be written into your working directory.
Now that we have calculated the abundance for each TSS in the previous step we can calculate TSRs (promoters) on each of the 4 separate datasets using the function `determineTSR`.
We select a `tagCount` threshold of 25 tags in order for a TSS to be considered.
The option `clustDist` is critical for the identification of TSRs and refers to the minimum distance between distinct TSRs (in other words, adjacent TSSs seperated by more than `clustDist` nucleotides will be in different TSRs).
As with the previous step, we select 'writeTable=TRUE', and therefore we will find the output files ("TSRset-1" to "TSRset-4") written to the working directory. As for `processTSS`, we can run this function in parallel. We have specified 4 cores.
```{r eval=FALSE}
Hs_RAMPAGE <- determineTSR(experimentName=Hs_RAMPAGE, n.cores=4, /
tsrSetType="replicates", tssSet="all", tagCountThreshold=25, clustDist=20, /
writeTable=TRUE)
```
To calculate TSRs from each sample (as opposed to each replicate) we need to combine our replicate data. This will be done using the identifiers we specified on our tssObject S4 object using `loadTSSobj`.
```{r eval=FALSE}
Hs_RAMPAGE <- mergeSampleData(experimentName=Hs_RAMPAGE)
```
Having combined the TSS abundance of replicate data into samples, we next proceed with identifying TSRs for the two samples individually.
We specify this with 'tsrSetType="merged"'.
```{r eval=FALSE}
#Generating the TSRs for the merged datasets:
Hs_RAMPAGE <- determineTSR(experimentName=Hs_RAMPAGE, n.cores=4, tsrSetType="merged" /
tssSet="all", tagCountThreshold=40, clustDist=20, writeTable=TRUE)
```
Now calculating the number of tags from each experiment within the combined set of TSRs:
```{r eval=FALSE}
Hs_RAMPAGE <- addTagCountsToTSR(experimentName=Hs_RAMPAGE, tsrSetType="merged", /
tsrSet=3, tagCountThreshold=40, writeTable=TRUE)
```
### Associating identified TSRs with gene annotations
Now that identifying TSRs are complete, an obvious and biologically useful step is to determine which TSRs are adjacent to annotated genes, and to retrieve the appropriate gene IDs. Before doing this, it is imperative to select an annotation file that was generated for the assembly to which the reads were aligned. For this example we will retrieve the appropriate annotation files from the Bioconductor package `AnnotationHub`.
Please install `AnnotationHub` if you haven't already done so.
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")
```
In our case we need to download the [Gencode](https://www.gencodegenes.org/) annotation.
We do this in the following manner:
```{r eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
query(hub, c("gencode", "gff", "human"))
```
This reveals nine gff3 annotations from Gencode that we can choose from.
We will select the full annotation, which has the identifier "AH49555".
```
AnnotationHub with 9 records
# snapshotDate(): 2017-01-05
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49554"]]'
title
AH49554 | gencode.v23.2wayconspseudos.gff3.gz
AH49555 | gencode.v23.annotation.gff3.gz
AH49556 | gencode.v23.basic.annotation.gff3.gz
AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz
AH49560 | gencode.v23.polyAs.gff3.gz
AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz
AH49562 | gencode.v23.tRNAs.gff3.gz
```
Using TSRchitect, We can use the function `importAnnotationHub` to import our desired annotated record and attach it our `tssObject`.
We accomplish this as follows:
```{r eval=FALSE}
Hs_RAMPAGE <- importAnnotationHub(experimentName=Hs_RAMPAGE, provider="gencode", /
annotType="gff3", species="human", annotID="AH49555")
```
Next, we associate the gene annotation to the TSRs within our two merged samples.
We selected the feature 'transcript' from the Gencode annotation.
```{r eval=FALSE}
Hs_RAMPAGE <- addAnnotationToTSR(experimentName=Hs_RAMPAGE, tsrSetType="merged", /
tsrSet=1, upstreamDist=1000, downstreamDist=200, feature="transcript", /
featureColumnID="ID", writeTable=TRUE)
Hs_RAMPAGE <- addAnnotationToTSR(experimentName=Hs_RAMPAGE, tsrSetType="merged", /
tsrSet=2, upstreamDist=1000, downstreamDist=200, feature="transcript", /
featureColumnID="ID", writeTable=TRUE)
```
Finally, we will repeat the two commands above, instead associating the gene annotation to the "combined" set of TSRs, which is found in the 3rd position on the `tsrDataMerged` slot.
```{r eval=FALSE}
Hs_RAMPAGE <- addAnnotationToTSR(experimentName=Hs_RAMPAGE, tsrSetType="merged", /
tsrSet=3, upstreamDist=1000, downstreamDist=200, feature="transcript", /
featureColumnID="ID", writeTable=TRUE)
```
Let's briefly look at the sets of TSRchitect-identified TSRs.
Using the accessor methods we applied in earlier examples, let's take a quick glance at our set of identified TSRs.
```{r eval=FALSE}
HT1080.tsrs <- getTSRdata(Hs_RAMPAGE, slotType="merged", slot=1)
dim(HT1080.tsrs)
```
```{r eval=FALSE}
H460.tsrs <- getTSRdata(Hs_RAMPAGE, slotType="merged", slot=2)
dim(H460.tsrs)
```
```{r eval=FALSE}
combined.tsrs <- getTSRdata(Hs_RAMPAGE, slotType="merged", slot=3)
dim(combined.tsrs)
```
Let's look at some of the tsrs we identified on our 'combined' set.
```{r eval=FALSE}
head(combined.tsrs)
```
We see that there are 22750 TSRs identified in the combined set, and 15904 and 18040 TSRs in the H460 and HT1080 samples, respectively.
We also notice that there are 5 additional columns in the combined set.
This is due to us previouly having added tag counts to the combined set of TSRs using `addTagCountsToTSR', something we did not do in this vignette for the two individual samples.
You now have a complete set of TSS and TSR data attached to your `tssObject` S4 object, in addition the tables that were already written to your working directory.
To better understand our data, let's explore some of the characteristics of the TSRs we have identified. TSRchitect calculates the Shape Index (SI) of each TSR; the SI provides a quantitative measure of TSR (and thus promoter) shape by representing the entropy of the distribution of TSSs associated with it. An SI of 2 (which is the maximum value possible) will have only a single unique TSS coordinate, whereas a TSR with a negative SI value will have an diversity of mapped 5' ends at distinct TSS positions. We can use SI values to classify TSRs into 'peaked' and 'broad' classes, having high and low SI values, respectively. As of TSRchitect version 1.8.0, we have included the Modified Shape Index (mSI) as an additional metric defining TSR shape. The mSI is a metric of TSR shape scaled by TSS tag abundance that can return possible values from 0 to 1 inclusive, where 1 is the most peaked (i.e. a TSR with a single unique TSS position) and 0 is the most broad TSR.
We can visualize the shape distribution of our identified TSRs quite easily using Hadley Wickham's `ggplot2` graphics package, as follows. For these plots we filter will out TSRs at lower abundances (100 tags/TSR) to show only the SI values from reasonably well-sampled TSRs.
```{r eval=FALSE}
require(ggplot2)
HT1080.tsrs.filtered <- HT1080.tsrs[HT1080.tsrs$nTAGs > 100,]
t <- ggplot(HT1080.tsrs.filtered, aes(tsrSI))
t + geom_histogram(binwidth=0.1, fill="blue2") /
+ ylab("Number of Tags per TSR") + xlab("Shape Index (SI)")
ggsave(file="HT1080_SI.png")
```
```{r eval=FALSE}
H460.tsrs.filtered <- H460.tsrs[H460.tsrs$nTAGs > 100,]
p <- ggplot(H460.tsrs.filtered, aes(tsrSI))
p + geom_histogram(binwidth=0.1, fill="darkgreen") /
+ ylab("Number of Tags per TSR") + xlab("Shape Index (SI)")
ggsave(file="H460_SI.png")
```
From both histograms we can see that there appears to be a bimodal distribution of TSRs within our samples; the class of 'completely peaked' TSRs with an SI value of 2, and another distribution of more broad TSRs with SI values centered just below 0. This is consistent with two major shape classes of TSRs known in metazoans, particularily human and mouse.
Similarly, we could use the same code framework above to plot the distributions of TSR width (tsrWdth), number of unique TSSs per TSR (nTSSs), the mSI and torque (balance of the TSR according to tag counts; tsrTrq); to do this we would select the appropriate column in the data frame. For example: equivalent plots can generated for mSI simply by replacing the term "tsrSI" with "tsrMSI" in the two code snippets above.
This concludes Example 1. Should we wish to save our tssObject and return to our work later, we simply type the following, which will write an R binary to your working directory.
```{r eval=FALSE}
save(Hs_RAMPAGE, file="Hs_RAMPAGE.RData")
```
__Important note__: before you continue with another example, please move the output files generated in your working directory to a separate, dedicated folder.
Otherwise some or all of the files you generate in subsequent examples will be overwritten.
## Example 2: Identifying promoters in the model plant A. thaliana using a PEAT dataset.
For our second example will process TSS profiling data from Arabidopsis root tissue.
These data come from the Megraw Lab at Oregon State University as reported in Morton et al., 2014. [Link to paper:](http://www.plantcell.org/content/26/7/2746)
As with the previous example, first we must download the raw data. In this case we have only a single alignment file to retrieve, which is found here: https://oregonstate.app.box.com/s/3lb3spmqbiuofhbubovc1z8bfthmxh6f
Once download of the file `peat.sorted.bam` is complete, please move it to subdirectory "PEATbam/".
The annotation file is available from the TAIR10 database: ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff. Please move it into the downloads/ folder you have created, e.g.
```{r eval=FALSE}
mkdir downloads
cd downloads
wget --content-disposition ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
cd ..
```
Because there is only a single experiment, setting the sample IDs is simple in this example:
```{r eval=FALSE}
At_PEAT <- loadTSSobj(experimentTitle ="Arabidopsis PEAT dataset", inputDir="PEATbam", /
isPairedBAM=TRUE, sampleNames=c("experiment1"), replicateIDs=c(1))
```
Now we convert the alignment data (in this case, in .bam format) into TSS coordinates:
```{r eval=FALSE}
At_PEAT <- inputToTSS(At_PEAT)
```
As in the previous example, now we can calculate the tag abundance at each location using `processTSS` and the identify TSRs within the sample using `determineTSR`. Note that we do not need to use `mergeSampleData` because there is only a single sample. As there is only a single sample we set `n.cores=1`.
```{r eval=FALSE}
At_PEAT <- processTSS(experimentName=At_PEAT, n.cores=1, tssSet="all", writeTable=TRUE)
At_PEAT <- determineTSR(experimentName=At_PEAT, n.cores=1, tsrSetType="replicates", /
tssSet="all", tagCountThreshold=25, clustDist=20, writeTable=TRUE)
```
### Associating identified TSRs with gene annotations
We continue by associating our newly-identified TSRs with genes from the TAIR10 annotation. Note that we use different parameters for upstreamDist and downstreamDist than we did in Example 1. This is due to the high degree of compactness in the A. thaliana genome.
```{r eval=FALSE}
At_PEAT <- importAnnotationExternal(experimentName=At_PEAT, fileType="gff3", /
annotFile="downloads/TAIR10_GFF3_genes.gff")
At_PEAT <- addAnnotationToTSR(experimentName=At_PEAT, tsrSetType="replicates", tsrSet=1, /
upstreamDist=500, downstreamDist=200, feature="gene", featureColumnID="ID", writeTable=TRUE)
```
Now we have a complete set of TSRs on our tssObj object. Let's take a look at them using one of our accessor methods.
```{r eval=FALSE}
At.tsrs <- getTSRdata(At_PEAT, slotType="replicates", slot=1)
dim(At.tsrs)
head(At.tsrs)
```
We can optionally save the tssObject as we have previously.
```{r eval=FALSE}
save(At_PEAT, file="At_PEAT_vignette.RData")
```
## Example 3: Analysis of CAGE datasets from the ENCODE project
As we stated in the introduction of this vignette, TSRchitect is capable of handling diverse forms of TSS profiling data. In the first two examples, we analyze two distinct paired-end datasets: RAMPAGE and PEAT, respectively. In this example we will process data from CAGE, which is the most widely-used TSS profiling method to date.
We will analyze CAGE data generated in two well-characterized immortalized cell lines, MCF-7 and A549. MCF-7 cells are derived from a breast cancer tumor, and A549 originates from an adenocarinoma isolated from lung tissue. Both datasets are part of the ENCODE project, and therefore we can make use of the ENCODExplorer package that we originally introduced in Example 1.
```{r eval=FALSE}
#Downloading the files:
library(ENCODExplorer)
data(encode_df, package="ENCODExplorer")
cage_data <- fuzzySearch(searchTerm=c("ENCFF552BXH","ENCFF288VTZ","ENCFF265RSX", /
"ENCFF944PCJ"), database=encode_df, /
filterVector=c("file_accession"), multipleTerm=TRUE)
downloadEncode(cage_data, df=encode_df, format="bam")
```
Now that the files have been downloaded, we will create symbolic links with the appropriate sample names. Please run the following commands from a linux command line:
```{bash eval=FALSE}
mkdir downloads # ignore if the directory already exists
mv *.bam downloads
mkdir HsCAGEbam
cd HsCAGEbam
ln -s ../downloads/ENCFF265RSX.bam A549-rep1.bam
ln -s ../downloads/ENCFF944PCJ.bam A549-rep2.bam
ln -s ../downloads/ENCFF552BXH.bam MCF7-rep1.bam
ln -s ../downloads/ENCFF288VTZ.bam MCF7-rep2.bam
cd ..
```
As in Example 1, we also need to provide human gene annotation.
Please download and uncompress the annoation file found at the following location [Note: you may ignore this if you have already downloaded the file from Example 1]:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz.
Please move the uncompressed file into the downloads/ directory if it does not already exist there, e.g. with
```{bash eval=FALSE}
cd downloads
wget --content-disposition ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz
gunzip gencode.v19.annotation.gff3.gz
cd ..
```
Now we can set up the tssObject S4 object. Note that we must specify `isPairedBAM=FALSE` because this is single-end CAGE data.
```{r eval=FALSE}
# initializing the tssObject, setting the sample IDs and importing the CAGE data
CAGEhuman <- loadTSSobj(experimentTitle ="Human CAGE", inputDir="HsCAGEbam", /
isPairedBAM=FALSE, sampleNames=c("A549-rep1","A549-rep2", /
"MCF7-rep1","MCF7-rep2"), replicateIDs=c(1,1,2,2) )
```
As before, it is vital to provide arguments to `sampleNames` and `replicateIDs` in the order of the files on the `fileNames` slot to they exactly correspond to the alignment files (in this case .bam) that we imported.
As in the prior two examples, we then extract TSS information from the attached alignment data (which was placed on slot `@bamData`):
```{r eval=FALSE}
#Converting the alignment data into TSS information and attaching it to the tssObject:
CAGEhuman <- inputToTSS(experimentName=CAGEhuman)
```
Next we must calculate the CAGE tag abundance at each TSS position, followed by identification of TSRs within our 4 replicate datasets. As in the first example, we choose to run `processTSS` and `determineTSR` in parallel on 4 cores. Please adjust this parameter as needed.
```{r eval=FALSE}
#Constructing the tag count per TSS data matrix:
CAGEhuman <- processTSS(experimentName=CAGEhuman, n.cores=4, tssSet="all", /
writeTable=TRUE)
```
```{r eval=FALSE}
#Finding TSRs for the replicate datasets:
CAGEhuman <- determineTSR(experimentName=CAGEhuman, n.cores=4, /
tsrSetType="replicates", tssSet="all", tagCountThreshold=25, /
clustDist=20, writeTable=TRUE)
```
Now we merge data from replicates into their two corresponding samples.
```{r eval=FALSE}
#Merging TSS data from the replicates:
CAGEhuman <- mergeSampleData(experimentName=CAGEhuman)
```
Once this is complete, we can complete TSR identification on the merged samples.
```{r eval=FALSE}
#Finding TSRs for the merged samples and adding tag counts:
CAGEhuman <- determineTSR(experimentName=CAGEhuman, n.cores=4, tsrSetType="merged", /
tssSet="all", tagCountThreshold=40, clustDist=20, writeTable=TRUE)
CAGEhuman <- addTagCountsToTSR(experimentName=CAGEhuman, tsrSetType="merged", /
tsrSet=3, tagCountThreshold=40, writeTable=TRUE)
```
Now we need to import the annotation file and attach it to our tssObj S4 object. To do this, we will to use the same record (Gencode v. 23) that we referred to in Example 1.
```{r eval=FALSE}
CAGEhuman <- importAnnotationHub(experimentName=CAGEhuman, provider="gencode", /
annotType="gff3", species="human", annotID="AH49555")
```
Next we associate the gene annotation to the TSRs within our two merged samples i) MCF7 cells and ii) A549 cells.
As we did in Example 1, we select the feature 'transcript' from the Gencode annotation.
```{r eval=FALSE}
CAGEhuman <- addAnnotationToTSR(experimentName=CAGEhuman, tsrSetType="merged", tsrSet=1,/
upstreamDist=1000, downstreamDist=200, feature="transcript", featureColumnID="ID", /
writeTable=TRUE) #A549 cells
CAGEhuman <- addAnnotationToTSR(experimentName=CAGEhuman, tsrSetType="merged", tsrSet=2, /
upstreamDist=1000, downstreamDist=200, feature="transcript", featureColumnID="ID", /
writeTable=TRUE) #MCF7 cells
```
Associating the selected annotation features with the TSRs on the 'combined' slot:
```{r eval=FALSE}
CAGEhuman <- addAnnotationToTSR(experimentName=CAGEhuman, tsrSetType="merged", tsrSet=3,/
upstreamDist=1000, downstreamDist=200, feature="transcript", featureColumnID="ID",/
writeTable=TRUE)
```
Using the accessor methods we applied in earlier examples, let's take a quick glance at our set of identified TSRs.
```{r eval=FALSE}
getTSRdata(CAGEhuman, slotType="merged", slot=1) -> MCF7.tsrs
dim(MCF7.tsrs)
```
```{r eval=FALSE}
getTSRdata(CAGEhuman, slotType="merged", slot=2) -> A549.tsrs
dim(A549.tsrs)
```
```{r eval=FALSE}
getTSRdata(CAGEhuman, slotType="merged", slot=3) -> CAGEhuman.tsrs
dim(CAGEhuman.tsrs)
```
Let's look at some of the tsrs we identified on our 'combined' set.
```{r eval=FALSE}
head(CAGEhuman.tsrs)
```
We can optionally save the tssObject for future use:
```{r eval=FALSE}
save(CAGEhuman, file="CAGEhuman-vignette.RData")
```