monocle3-example.Rmd

In order to run monocle3 a few steps must first be followed. Monocle3 requires R version 3.5 or higher, bioconductor 3.5 or higher and monocle3 0.1.0 to run the latest features.
1) Installing bioconductor:

```{r message=FALSE, warning=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()
```

A few bioconductor dependencies are not automatically installed so you should install the,.
2) Installing bioconductor dependencies:

```{r message=FALSE, warning=FALSE}
BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats',
                   'limma', 'S4Vectors', 'SingleCellExperiment',
                   'SummarizedExperiment', 'batchelor'))
```

In order to install monocle3 a few preparations must be made. First devtools must be installed into your R environment via the R console.
3) Installing devtools:

```{r message=FALSE, warning=FALSE}
install.packages('devtools')
```

If your rlang version is out of data this can cause a failure when trying to install monocle3 and garnett later on.
4) Updating rlang

```{r message=FALSE, warning=FALSE}
install.packages('rlang')
library(rlang)
```

Next your must install libudunits2-dev in your terminal. 
5) Installing libudunits2-dev:
    sudo apt-get update    
    sudo apt-get install libudunits2-dev
Finally libgdal-dev must also be installed on your terminal.
6) Installing libgdal-dev:
    sudo apt-get update    
    sudo apt-get install --fix-missing libgdal-dev
Now we can finally installmonocle3.

7)Installing monocle3:
```{r message=FALSE, warning=FALSE}
devtools::install_github('cole-trapnell-lab/monocle3')
```

8) Installing garnett:
```{r message=FALSE, warning=FALSE}
BiocManager::install('org.Mm.eg.db')
BiocManager::install('org.Hs.eg.db')
devtools::install_github("cole-trapnell-lab/garnett", ref = 'monocle3')
```

Run the following code chunk to check everything has been installed correctly, and to load all relevant packages for this tutorial.
```{r message=FALSE, warning=FALSE}
library(monocle3)
library(shiny)
library(ggplot2)
library(dplyr)
library(garnett)
library(Matrix)
library(irlba)
```

#Clustering and classifying cells

The first thing we must do is read cellranger data into R. In order for the to correctly be done you must first ensure the data is in the right folder system. The folder system used depends on whether you have 10X version 2 or 10X version 3 data. Incase your aren't sure which version of 10X data you have: v2 barcode and gene files end .tsv and matrix file ends .mtx while v3 barcode and gene files end .tsv.gz and matrix file ends .mtx.gz. 
7) a) If 10X version 2 the folder system is:
      ~/data/monocle-data/cellranger-data/outs/filtered_gene_bc_matrices/genome(e.g 'hg19' or 'mm10' etc depending on genome used)/ 
    b) If 10X version 3 the folder system is:
      ~/data/monocle-data/cellranger-data/outs/filtered_feature_bc_matrix/genome(e.g 'hg19' or 'mm10' etc depending on genome used)/
Next we preprocess the data. Firstly we state how may principle components (PC) we want monocle to use, this can be changed by changing the value num_dimis equal to. Then we plot the data set to see if increasing the PC number would capture significantly more variance. The elbow plot produced will then inform you whether increasing PC number will capture a significant amount of more variance. When visualising the data you can use t-SNE or UMAP, UMAP is used by default as the people at monocle believe it is both faster and more suited to clustering. When doing gene expression analysis it is important to remove batch effect. You should always check for batch effect when performing dimensionality reduction, batch effect is reduced by the preprocess_cds function. When order_cells() runs it will try to load a shiny web page but will fail. That is expected. Go to your Jupyter terminal and run ngrok http PORT (where PORT is the 4 digit number at the end of the http message ran by shiny). This will produce multiple lines of code, copy the http message on the penultimate line (should start with the word 'fowarding'). Select from 'http' to before the arrow '->'. Enter this web address into a new tab and the then select the node on the graph you wish to use as a basis. You can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up but will produce slightly different output each time it runs.

```{r}
# Load RDS data
expression_matrix <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation <- readRDS("../data/monocle-data/cao_l2_rowData.rds")

# Make the cell dataset object from the RDS data
cds <- new_cell_data_set(expression_matrix,
                                   cell_metadata = cell_metadata,
                                   gene_metadata = gene_annotation)
```

```{r}
# Reading in cellranger data
cdr <- load_cellranger_data("/home/jovyan/data/monocle-data/cellranger-data")
```

```{r}
# Principle component analysis
cds = preprocess_cds(cds, num_dim = 100)
```

```{r}
# Checking if 100 principle components captures enough variance
plot_pc_variance_explained(cds)
```

```{r}
# Reducing dimensionality for visualisation of the cells
# can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up 
# but will produce slightly different output each time its run
cds = reduce_dimension(cds)
```

```{r}
# Clustering cells using community detection
cds = cluster_cells(cds, resolution=c(10^seq(-6,-1)))
```

```{r}
# Plotting the data colouring by partition
plot_cells(cds, color_cells_by="partition", group_cells_by="partition")
```

The data frame marker_test_res contains metrics which tell us how differently expressed each gene is in each partition. You can rank the cells by one or more specifity metrics and take the top genes for each cluster, the way in which is do this is pseudo_R2. The default number of marker genes in this alogrithm is 3 but if you want to change this edit the 3 in top_n(3, pseudo_R2) to however many marker genes you want. Then we can plot the expression and fraction of cells that express each marker in each group usng plot_genes_by_group(). 

```{r}
pheatmap::pheatmap(log(table(clusters(cds), colData(cds)$cao_cell_type)+1),
                   clustering_method="ward.D2",
                   fontsize=6)
```

```{r  results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
marker_test_res = top_markers(cds, group_cells_by="partition", reference_cells=1000, cores=8)
```

```{r}
# Find marker genes expressed by each cluster
top_specific_markers = marker_test_res %>%
    filter(fraction_expressing >= 0.10) %>%
    group_by(cell_group) %>%
    top_n(3, pseudo_R2)

top_specific_marker_ids = unique(top_specific_markers %>% pull(gene_id))
```

```{r message=FALSE, warning=FALSE}
# Plotting marker genes
plot_genes_by_group(cds,
                    top_specific_marker_ids,
                    group_cells_by="partition",
                    ordering_type="cluster_row_col",
                    max.size=3)
```

Monocle3 has developed some software which automates the the annotating of cells called Garnett. Garnett classifies cells based on marker genes. Firstly we must fnd the top marker that each annotated cell type expresses using top_markers(). Next we filter these markers based on the conditions required (in this example it is JS specificity > 0.5 and be significant in logistic test marker_test_q_value < 0.01). We then generate a marker file called ./marker_file.txt which contains each cell type and top 5 expressed markers.

```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
# Annotate your cells according to type
assigned_type_marker_test_res = top_markers(cds,
                                            group_cells_by="cluster",
                                            reference_cells=1000,
                                            cores=8)
```

```{r}
# Require that markers have at least JS specificty score > 0.5 and be significant
# in the logistic test for identifying their cell type:
garnett_markers = assigned_type_marker_test_res %>%
  filter(marker_test_q_value < 0.01 & specificity >= 0.5) %>%
  group_by(cell_group) %>%
  top_n(5, marker_score)
# Exclude genes that are good markers for more than one cell type:
garnett_markers = garnett_markers %>% group_by(gene_short_name) %>%
  filter(n() == 1)
generate_garnett_marker_file(garnett_markers, file="./marker_file.txt")
```

```{r message=FALSE, warning=FALSE}
plot_cells(cds,
           group_cells_by="cluster",
           color_cells_by="cluster")
```

#Constructing Single Cell Trajectories

Loading a different data set in for example purposes. If you are usin  same data set as earlier  the first 2 steps can be missed (assuming data has already been downloaded and preprocessed)

```{r}
# Loading more RDS data
expression_matrix2 = readRDS("../data/monocle-data/packer_embryo_expression.rds")
cell_metadata2 = readRDS("../data/monocle-data/packer_embryo_colData.rds")
gene_annotation2 = readRDS("../data/monocle-data/packer_embryo_rowData.rds")
```

```{r}
cdt <- new_cell_data_set(expression_matrix2,
                         cell_metadata = cell_metadata2,
                         gene_metadata = gene_annotation2)
```

```{r message=FALSE, warning=FALSE} 
cdt <- preprocess_cds(cdt, num_dim = 100, residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")

cdt <- reduce_dimension(cdt)
```

```{r}
cdt <- cluster_cells(cdt, reduction_method = 'UMAP')
```

A principal  graph is fitted within each partition in the next component.

```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
cdt <- learn_graph(cdt)
```

The next part is creating a list of genes to try and identify in the trajectory of the sample. If you have different genes you want to check then switch out the names in the ciliated genes list and add or remove genes as required.

```{r}
ciliated_genes = c("che-1",
                   "hlh-17",
                   "nhr-6",
                   "dmd-6",
                   "ceh-36",
                   "ham-1")
```

The following is plotting the trajectory multiple times, one for each gene in ciliated genes. And identifying where on the trajectory of cells this gene is expressed.

```{r}
# Visualise how individual genes vary along the trajectory
plot_cells(cdt,
           genes=ciliated_genes,
           label_cell_groups=FALSE,
           show_trajectory_graph=FALSE)
```

Now you have to select which cells you want as you starting point. A pop up will appear with options "trust browser" or cancel. Click "try again". Then the graph should appear and you can select your starting cells.

```{r}
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
  cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
  
  closest_vertex <-
  cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-
  igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
  (which.max(table(closest_vertex[cell_ids,]))))]
  
  root_pr_nodes
}
```


```{r}
cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))
```

```{r}
# Visualising pseudotime on the graph when manually selecting the root principal points
plot_cells(cdt,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)
```

```{r}
# A helper function to identify the root principal points:
get_earliest_principal_node <- function(cdt, time_bin="130-170"){
  cell_ids <- which(colData(cdt)[, "embryo.time.bin"] == time_bin)

  closest_vertex <-
    cdt@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cdt), ])
  root_pr_nodes <-
    igraph::V(principal_graph(cdt)[["UMAP"]])$name[as.numeric(names
      (which.max(table(closest_vertex[cell_ids,]))))]

  root_pr_nodes
}
cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))
```

```{r}
# Visualising pseudotime on the graph when the helper function decideds the root principal points
plot_cells(cdt,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)
```

```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
# Processing data for a 3D plot
cds_3d = reduce_dimension(cdt, max_components = 3)
cds_3d = cluster_cells(cds_3d)
cds_3d = learn_graph(cds_3d)
cds_3d = order_cells(cds_3d, root_pr_nodes=get_earliest_principal_node(cdt))
```

```{r message=FALSE, warning=FALSE}
# 3D visualisation of data
plot_cells_3d(cds_3d, color_cells_by="pseudotime")
```

#Differential expression analysis

```{r}
ciliated_genes = c("che-1",
                   "hlh-17",
                   "nhr-6",
                   "dmd-6",
                   "ceh-36",
                   "ham-1")
cds_subset = cdt[rowData(cdt)$gene_short_name %in% ciliated_genes,]

gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time")

fit_coefs = coefficient_table(gene_fits)

emb_time_terms = fit_coefs %>% filter(term == "embryo.time")

emb_time_terms %>% filter (q_value < 0.05) %>%
  select(gene_short_name, term, q_value, estimate)
```

```{r message=FALSE, warning=FALSE}
plot_genes_violin(cds_subset, group_cells_by="embryo.time.bin", ncol=2) +
    theme(axis.text.x=element_text(angle=45, hjust=1))
```

```{r}
gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time + batch")
fit_coefs = coefficient_table(gene_fits)
fit_coefs %>% filter(term != "(Intercept)") %>%
  select(gene_short_name, term, q_value, estimate)
```

```{r message=FALSE, warning=FALSE}
evaluate_fits(gene_fits)
```

```{r}
time_batch_models = fit_models(cds_subset,
                               model_formula_str = "~embryo.time + batch",
                               expression_family="negbinomial")
time_models = fit_models(cds_subset,
                        model_formula_str = "~embryo.time",
                        expression_family="negbinomial")
compare_models(time_batch_models, time_models) %>% select(gene_short_name, q_value)
```
  
```{r}
# reload the data
expression_matrix3 <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata3 <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation3 <- readRDS("../data/monocle-data/cao_l2_rowData.rds")
```

```{r}
# Make the CDS object
cdn <- new_cell_data_set(expression_matrix3,
                         cell_metadata = cell_metadata3,
                         gene_metadata = gene_annotation3)
```

```{r results = FALSE, echo=FALSE, message=FALSE, warning=FALSE}
cdn <- preprocess_cds(cdn)
cdn <- reduce_dimension(cdn, reduction_method = 'UMAP')
cdn <- cluster_cells(cdn)
cdn<- learn_graph(cdn)
```

```{r}
neurons_cds <- cdn[,grepl("neurons", colData(cdn)$assigned_cell_type, ignore.case=TRUE)]
plot_cells(neurons_cds)
```

```{r message=FALSE, warning=FALSE}
plot_cells(cdt,
           color_cells_by = "cell.type",
           label_groups_by_cluster=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE)
```

```{r}
plot_cells(cdt, genes=c("hlh-4", "gcy-8", "dac-1", "oig-8"),
           show_trajectory_graph=FALSE,
           label_cell_groups=FALSE,
           label_leaves=FALSE)
```

```{r}
plot_cells(cdt, show_trajectory_graph=FALSE)
```