-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
561 lines (417 loc) · 21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
---
title: "crisprScore: on-target and off-target scoring for CRISPR gRNAs"
output:
github_document:
toc: true
bibliography: vignettes/references.bib
---
```{r, echo=FALSE, results="hide"}
options("knitr.graphics.auto_pdf"=TRUE)
```
Authors: Jean-Philippe Fortin, Aaron Lun, Luke Hoberecht, Pirunthan Perampalan
Date: July 25, 2024
# Overview
The `crisprScore` package provides R wrappers of several on-target and off-target scoring
methods for CRISPR guide RNAs (gRNAs). The following nucleases are supported:
SpCas9, AsCas12a, enAsCas12a, and RfxCas13d (CasRx). The available on-target
cutting efficiency scoring methods are RuleSet1, RuleSet3, Azimuth, DeepHF,
DeepSpCas9, DeepCpf1, enPAM+GB, CRISPRscan and CRISPRater. Both the CFD and MIT
scoring methods are available for off-target specificity prediction. The
package also provides a Lindel-derived score to predict the probability
of a gRNA to produce indels inducing a frameshift for the Cas9 nuclease.
Note that DeepHF, DeepCpf1 and enPAM+GB are not available on Windows machines.
Our work is described in our recent papera recent bioRxiv preprint:
["A comprehensive Bioconductor ecosystem for the design of CRISPR guide RNAs across nucleases and technologies"](https://www.nature.com/articles/s41467-022-34320-7)
Our main gRNA design package [crisprDesign](https://github.com/crisprVerse/crisprDesign) utilizes the `crisprScore` package to add on- and off-target scores to user-designed gRNAs; check out our [Cas9 gRNA tutorial page](https://github.com/crisprVerse/Tutorials/tree/master/Design_CRISPRko_Cas9) to learn how to use `crisprScore` via `crisprDesign`.
# Installation and getting started
## Software requirements
### OS Requirements
This package is supported for macOS, Linux and Windows machines.
Some functionalities are not supported for Windows machines.
Packages require R version 4.4 or higher.
## Installation
`crisprScore` can be installed from Bioconductor using the following commands in a fresh R session:
```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version="devel")
BiocManager::install("crisprScore")
```
To install the release version, use this instead:
```{r, eval=FALSE}
BiocManager::install(version="release")
```
Under the hood, `crisprScore` uses the `basilisk` package to install Python
environments needed to run some of the prediction algorithms. When calling
one of the scoring methods for the first time after package
installation, the underlying python module and conda environments will be
automatically downloaded and installed without the need for user intervention.
This may take several minutes, but this is a one-time installation.
We made recent changes to install Python packages only from the public
Bioconda and conda-forge channels to be compliant with the latest
Anaconda licensing policies. Setting the environment variable `BASILISK_USE_MINIFORGE=1` will ensure that crisprScore will use miniforge
to install the Python packages. See the documentation of the `basilisk` package
[(link here)](https://www.bioconductor.org/packages/release/bioc/html/basilisk.html) for more information re. different settings for the installation of the
underlying Python packages
Note that RStudio users will need to add the following line to their `.Rprofile`
file in order for `crisprScore` to work properly:
```{r, eval=FALSE}
options(reticulate.useImportHook=FALSE)
```
# Getting started
We load `crisprScore` in the usual way:
```{r, warnings=FALSE, message=FALSE}
library(crisprScore)
```
The `scoringMethodsInfo` data.frame contains a succinct summary of scoring
methods available in `crisprScore`:
```{r}
data(scoringMethodsInfo)
print(scoringMethodsInfo)
```
Each scoring algorithm requires a different contextual nucleotide sequence.
The `left` and `right` columns indicates how many nucleotides upstream
and downstream of the first nucleotide of the PAM sequence are needed for
input, and the `len` column indicates the total number of nucleotides needed
for input. The `crisprDesign` [(GitHub link)](https://github.com/Jfortin1/crisprDesign)
package provides user-friendly functionalities to extract and score those
sequences automatically via the `addOnTargetScores` function.
# On-targeting efficiency scores
Predicting on-target cutting efficiency is an extensive area of research, and
we try to provide in `crisprScore` the latest state-of-the-art algorithms as
they become available.
## Cas9 methods
Different algorithms require different input nucleotide
sequences to predict cutting efficiency as illustrated in the figure below.
```{r, echo=FALSE,fig.cap="Sequence inputs for Cas9 scoring methods"}
knitr::include_graphics("vignettes/figures/sequences_cas9.svg")
```
### Rule Set 1
The Rule Set 1 algorithm is one of the first on-target efficiency methods
developed for the Cas9 nuclease [@ruleset1]. It generates a probability
(therefore a score between 0 and 1) that a given sgRNA will cut at its
intended target. 4 nucleotides upstream and 3 nucleotides downstream of
the PAM sequence are needed for scoring:
```{r, eval=TRUE}
flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
flank3 <- "TTG" #3bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getRuleSet1Scores(input)
```
The Azimuth score described below is an improvement over Rule Set 1
from the same lab.
### Azimuth
The Azimuth algorithm is an improved version of the popular Rule Set 2 score for
the Cas9 nuclease [@azimuth]. It generates a probability (therefore a score
between 0 and 1) that a given sgRNA will cut at its intended target.
4 nucleotides upstream and 3 nucleotides downstream of the PAM
sequence are needed for scoring:
```{r, eval=FALSE}
flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
flank3 <- "TTG" #3bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getAzimuthScores(input)
```
### Rule Set 3
The Rule Set 3 is an improvement over Rule Set 1 and Rule Set 2/Azimuth
developed for the SpCas9 nuclease, taking into account the type of
tracrRNAs [@ruleset3]. Two types of tracrRNAs are currently offered:
```
GTTTTAGAGCTA-----GAAA-----TAGCAAGTTAAAAT... --> Hsu2013 tracrRNA
GTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAAT... --> Chen2013 tracrRNA
```
Similar to Rule Set 1 and Azimuth, the input sequence requires 4 nucleotides
upstream of the protospacer sequence, the protospacer sequence itself
(20nt spacersequence and PAM sequence), and 3 nucleotides downstream of
the PAM sequence:
```{r, eval=FALSE}
flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
flank3 <- "TTG" #3bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getRuleSet3Scores(input, tracrRNA="Hsu2013")
```
A more involved version of the algorithm takes into account gene context of
the target protospacer sequence (Rule Set 3 Target) and will be soon
implemented in `crisprScore`.
### DeepHF
The DeepHF algorithm is an on-target cutting efficiency prediction algorithm for
several variants of the Cas9 nuclease [@deepcas9] using a recurrent neural
network (RNN) framework. Similar to the Azimuth score, it generates a
probability of cutting at the intended on-target. The algorithm only needs
the protospacer and PAM sequences as inputs:
```{r, eval=FALSE}
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
input <- paste0(spacer, pam)
results <- getDeepHFScores(input)
```
Users can specify for which Cas9 they wish to score sgRNAs by using the argument
`enzyme`: "WT" for Wildtype Cas9 (WT-SpCas9), "HF" for high-fidelity Cas9
(SpCas9-HF), or "ESP" for enhancedCas9 (eSpCas9). For wildtype Cas9, users can
also specify the promoter used for expressing sgRNAs using the argument
`promoter` ("U6" by default). See `?getDeepHFScores` for more details.
### DeepSpCas9
The DeepSpCas9 algorithm is an on-target cutting efficiency prediction
algorithm for the SpCas9 nuclease [@deepspcas9]. Similar to the Azimuth score,
it generates a probability of cutting at the intended on-target.
4 nucleotides upstream of the protospacer sequence, and 3 nucleotides
downstream of the PAM sequence are needed in top of the protospacer
sequence for scoring:
```{r, eval=FALSE}
flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
flank3 <- "TTG" #3bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getDeepSpCas9Scores(input)
```
```{r, eval=FALSE}
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
input <- paste0(spacer, pam)
results <- getDeepHFScores(input)
```
Users can specify for which Cas9 they wish to score sgRNAs by using the argument
`enzyme`: "WT" for Wildtype Cas9 (WT-SpCas9), "HF" for high-fidelity Cas9
(SpCas9-HF), or "ESP" for enhancedCas9 (eSpCas9). For wildtype Cas9, users can
also specify the promoter used for expressing sgRNAs using the argument
`promoter` ("U6" by default). See `?getDeepHFScores` for more details.
### CRISPRscan
The CRISPRscan algorithm, also known as the Moreno-Mateos score, is an
on-target efficiency method for the SpCas9 nuclease developed for sgRNAs
expressed from a T7 promoter, and trained on zebrafish data [@crisprscan].
It generates a probability (therefore a score between 0 and 1) that a given
sgRNA will cut at its intended target.
6 nucleotides upstream of the protospacer sequence
and 6 nucleotides downstream of the PAM sequence are needed for scoring:
```{r, eval=TRUE}
flank5 <- "ACCTAA" #6bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam <- "AGG" #3bp
flank3 <- "TTGAAT" #6bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getCRISPRscanScores(input)
```
### CRISPRater
The CRISPRater algorithm is an on-target efficiency method for the SpCas9 nuclease [@crisprater].
It generates a probability (therefore a score between 0 and 1) that a given
sgRNA will cut at its intended target.
Only the 20bp spacer sequence is required.
```{r, eval=TRUE}
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
results <- getCRISPRaterScores(spacer)
```
### CRISPRai
The CRISPRai algorithm was developed by the Weissman lab to score SpCas9
gRNAs for CRISPRa and CRISPRi applications [@crisprai], for the human genome.
The function `getCrispraiScores` requires several inputs.
First, it requires a data.frame specifying the genomic coordinates of
the transcription starting sites (TSSs). An example of such a data.frame
is provided in the crisprScore package:
```{r, eval=TRUE}
head(tssExampleCrispri)
```
It also requires a data.frame specifying the genomic coordinates of the
gRNA sequences to score. An example of such a data.frame
is provided in the crisprScore package:
```{r, eval=TRUE}
head(sgrnaExampleCrispri)
```
All columns present in `tssExampleCrispri` and `sgrnaExampleCrispri` are
mandatory for `getCrispraiScores` to work.
Two additional arguments are required: `fastaFile`, to specify the path of
the fasta file of the human reference genome, and `chromatinFiles`, which is
a list of length 3 specifying the path of files containing the chromatin
accessibility data needed for the algorithm in hg38 coordinates.
The chromatin files can be downloaded from Zenodo
[here](https://zenodo.org/record/6716721#.YrzCfS-cY4d).
The fasta file for the human genome (hg38) can be downloaded directly from here:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
One can obtain the CRISPRai scores using the following command:
```{r, eval=FALSE}
results <- getCrispraiScores(tss_df=tssExampleCrispri,
sgrna_df=sgrnaExampleCrispri,
modality="CRISPRi",
fastaFile="your/path/hg38.fa",
chromatinFiles=list(mnase="path/to/mnaseFile.bw",
dnase="path/to/dnaseFile.bw",
faire="oath/to/faireFile.bw"))
```
The function works identically for CRISPRa applications, with modality replaced
by `CRISPRa`.
## Cas12a methods
Different algorithms require different input nucleotide
sequences to predict cutting efficiency as illustrated in the figure below.
```{r, echo=FALSE, fig.cap="Sequence inputs for Cas12a scoring methods"}
knitr::include_graphics("vignettes/figures/sequences_cas12a.svg")
```
### DeepCpf1 score
The DeepCpf1 algorithm is an on-target cutting efficiency prediction algorithm
for the Cas12a nuclease [@deepcpf1] using a convolutional neural network (CNN)
framework. It generates a score between 0 and 1 to quantify the likelihood of
Cas12a to cut for a given sgRNA. 3 nucleotides upstream and 4 nucleotides
downstream of the PAM sequence are needed for scoring:
```{r, eval=FALSE}
flank5 <- "ACC" #3bp
pam <- "TTTT" #4bp
spacer <- "AATCGATGCTGATGCTAGATATT" #23bp
flank3 <- "AAGT" #4bp
input <- paste0(flank5, pam, spacer, flank3)
results <- getDeepCpf1Scores(input)
```
### enPAM+GB score
The enPAM+GB algorithm is an on-target cutting efficiency prediction algorithm
for the enhanced Cas12a (enCas12a) nuclease [@enpamgb] using a gradient-booster
(GB) model. The enCas12a nuclease as an extended set of active PAM sequences in
comparison to the wildtype Cas12 nuclease [@encas12a], and the enPAM+GB
algorithm takes PAM activity into account in the calculation of the final score.
It generates a probability (therefore a score between 0 and 1) of a given sgRNA
to cut at the intended target. 3 nucleotides upstream of the PAM sequence and 4 nucleotides downstream of the protospacer sequence are needed for scoring:
```{r, eval=FALSE}
flank5 <- "ACC" #3bp
pam <- "TTTT" #4bp
spacer <- "AATCGATGCTGATGCTAGATATT" #23bp
flank3 <- "AAGT" #4bp
input <- paste0(flank5, pam, spacer, flank3)
results <- getEnPAMGBScores(input)
```
## Cas13d methods
### CasRxRF
The CasRxRF method was developed to characterize on-target efficiency of the RNA-targeting nuclease RfxCas13d, abbreviated as CasRx [@casrxrf].
It requires as an input the mRNA sequence targeted by the gRNAs, and returns as an output on-target efficiency scores for all gRNAs targeting the mRNA sequence.
As an example, we predict on-target efficiency for gRNAs targeting the mRNA sequence stored in the file `test.fa`:
```{r, eval=FALSE}
fasta <- file.path(system.file(package="crisprScore"),
"casrxrf/test.fa")
mrnaSequence <- Biostrings::readDNAStringSet(filepath=fasta
format="fasta",
use.names=TRUE)
results <- getCasRxRFScores(mrnaSequence)
```
Note that the function has a default argument `directRepeat` set to `aacccctaccaactggtcggggtttgaaac`, specifying the direct repeat used in the
CasRx construct (see [@casrxrf].) The function also has an argument `binaries`
that specifies the file path of the binaries for three
programs necessary by the CasRxRF algorithm:
- `RNAfold`: available as part of the ViennaRNA package
- `RNAplfold`: available as part of the ViennaRNA package
- `RNAhybrid`: available as part of the RNAhybrid package
Those programs can be installed from their respective websites: [VienneRNA](https://www.tbi.univie.ac.at/RNA/) and [RNAhybrid](https://bibiserv.cebitec.uni-bielefeld.de/rnahybrid/).
If the argument is `NULL`, the binaries are assumed to be available on
the PATH.
# Off-target specificity scores
For CRISPR knockout systems, off-targeting effects can occur when the CRISPR
nuclease tolerates some levels of imperfect complementarity between gRNA spacer
sequences and protospacer sequences of the targeted genome. Generally, a greater
number of mismatches between spacer and protospacer sequences decreases the
likelihood of cleavage by a nuclease, but the nature of the nucleotide
substitution can module the likelihood as well. Several off-target specificity
scores were developed to predict the likelihood of a nuclease to cut at an
unintended off-target site given a position-specific set of nucleotide
mismatches.
We provide in `crisprScore` two popular off-target specificity scoring
methods for CRISPR/Cas9 knockout systems: the MIT score [@mit] and the
cutting frequency determination (CFD) score [@azimuth].
## MIT score
The MIT score was an early off-target specificity prediction algorithm developed
for the CRISPR/Cas9 system [@mit]. It predicts the likelihood that the Cas9
nuclease will cut at an off-target site using position-specific mismatch
tolerance weights. It also takes into consideration the total number of
mismatches, as well as the average distance between mismatches.
However, it does not take into account the nature of the nucleotide
substitutions. The exact formula used to estimate the cutting likelihood is
$$\text{MIT} = \biggl(\prod_{p \in
M}{w_p}\biggr)\times\frac{1}{\frac{19-d}{19}\times4+1}\times\frac{1}{m^2}$$
where $M$ is the set of positions for which there is a mismatch between the
sgRNA spacer sequence and the off-target sequence, $w_p$ is an
experimentally-derived mismatch tolerance weight at position $p$, $d$ is the
average distance between mismatches, and $m$ is the total number
of mismatches. As the number of mismatches increases, the cutting
likelihood decreases. In addition, off-targets with more adjacent mismatches
will have a lower cutting likelihood.
The `getMITScores` function takes as argument a character vector of 20bp
sequences specifying the spacer sequences of sgRNAs (`spacers` argument), as
well as a vector of 20bp sequences representing the protospacer sequences of the putative off-targets in the targeted
genome (`protospacers` argument). PAM sequences (`pams`) must also be provided. If only one spacer sequence is provided,
it will reused for all provided protospacers.
The following code will generate MIT scores for 3 off-targets with respect to
the sgRNA `ATCGATGCTGATGCTAGATA`:
```{r}
spacer <- "ATCGATGCTGATGCTAGATA"
protospacers <- c("ACCGATGCTGATGCTAGATA",
"ATCGATGCTGATGCTAGATT",
"ATCGATGCTGATGCTAGATA")
pams <- c("AGG", "AGG", "AGA")
getMITScores(spacers=spacer,
protospacers=protospacers,
pams=pams)
```
## CFD score
The CFD off-target specificity prediction algorithm was initially developed for
the CRISPR/Cas9 system, and was shown to be superior to the MIT score
[@azimuth]. Unlike the MIT score, position-specific mismatch weights vary
according to the nature of the nucleotide substitution (e.g. an A->G mismatch at
position 15 has a different weight than an A->T mismatch at position 15).
Similar to the `getMITScores` function, the `getCFDScores` function takes as
argument a character vector of 20bp sequences specifying the spacer sequences of
sgRNAs (`spacers` argument), as well as a vector of 20bp sequences representing
the protospacer sequences of the putative
off-targets in the targeted genome (`protospacers` argument).
`pams` must also be provided.
If only one spacer
sequence is provided, it will be used for all provided protospacers.
The following code will generate CFD scores for 3 off-targets with respect to
the sgRNA `ATCGATGCTGATGCTAGATA`:
```{r}
spacer <- "ATCGATGCTGATGCTAGATA"
protospacers <- c("ACCGATGCTGATGCTAGATA",
"ATCGATGCTGATGCTAGATT",
"ATCGATGCTGATGCTAGATA")
pams <- c("AGG", "AGG", "AGA")
getCFDScores(spacers=spacer,
protospacers=protospacers,
pams=pams)
```
# Indel prediction score
## Lindel score (Cas9)
Non-homologous end-joining (NHEJ) plays an important role in double-strand break
(DSB) repair of DNA. Error patterns of NHEJ can be strongly biased by sequence
context, and several studies have shown that microhomology can be used to
predict indels resulting from CRISPR/Cas9-mediated cleavage. Among other useful
metrics, the frequency of frameshift-causing indels can be estimated for a given
sgRNA.
Lindel [@lindel] is a logistic regression model that was trained to use local
sequence context to predict the distribution of mutational outcomes.
In `crisprScore`, the function `getLindelScores` return the proportion of
"frameshifting" indels estimated by Lindel. By chance, assuming a random
distribution of indel lengths, frameshifting proportions should be roughly
around 0.66. A Lindel score higher than 0.66 indicates a higher than by chance
probability that a sgRNA induces a frameshift mutation.
The Lindel algorithm requires nucleotide context around the protospacer
sequence; the following full sequence is needed:
[13bp upstream flanking sequence][23bp protospacer sequence]
[29bp downstream flanking sequence], for a total of 65bp.
The function `getLindelScores` takes as inputs such 65bp sequences:
```{r, eval=FALSE}
flank5 <- "ACCTTTTAATCGA" #13bp
spacer <- "TGCTGATGCTAGATATTAAG" #20bp
pam <- "TGG" #3bp
flank3 <- "CTTTTAATCGATGCTGATGCTAGATATTA" #29bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getLindelScores(input)
```
# License
The project as a whole is covered by the MIT license. The code for all
underlying Python packages, with their original licenses, can be found in
`inst/python`. We made sure that all licenses are compatible with the MIT
license and to indicate changes that we have made to the original code.
# Reproducibility
```{r}
sessionInfo()
```
# References