CELESTA (CELl typE identification with SpaTiAl information) is an algorithm aiming to perform automate cell type identification for multiplexed in situ imaging data. CELESTA makes use of both protein expressions and cell spatial neighborhood information from segmented imaging data for the cell type identification.
The pre-saved imaging data is taken from reg009 of the published CODEX data Schurch et al. Cell,2020 for illustration purpose.
CreateCelestaObject()
Creates an object running CELESTA. It requires a title to create the project, segmented imaging data file and prior knowledge file for cell-type signature matrix (user-defined).FilterCells()
This step intends to fill out questionable cells due to imaging artifacts, segmentation error etc.PlotExpProb()
This function plots the calculated expression probabilities for each marker included in the user-defined prior cell-type signature matrix. It can be used to visualize and help with setting the thresholds for whether a marker is expressed or not.AssignCells()
This is the main function to assign cell types with an iterative EM algorithm.PlotCellsAnyCombination()
This function can be used to plot the cells with identified cell types with the XY coordinates from segmentation.
You can install the development version of CELESTA
# install.packages("devtools")
devtools::install_github("plevritis/CELESTA")
CELESTA requires dependency on the following R packages: - Rmixmod: for performing Gaussian Mixture Modeling - spdep: for obtaining spatial neighborhood information - zeallot: for R code styling. Provides a %<-% operator to perform multiple, unpacking, and destructuring assignment in R. - ggplot2 reshape2: for plotting
library(CELESTA)
library(Rmixmod)
library(spdep)
library(ggplot2)
library(reshape2)
library(zeallot)
### The pre-saved imaging data is taken from reg009 of the published CODEX data Schurch et al. Cell,2020
### Create CELESTA object. It requires a title for the project.
### It also required the segmented input file and user-defined cell-type signature matrix.
### Please refer to the Inputs session below.
CelestaObj <- CreateCelestaObject(project_title = "project_title",prior_marker_info,imaging_data)
### Filter out questionable cells.
### A cell with every marker having expression probability higher than 0.9 are filtered out.
### And A cell with every marker having expression probability lower than 0.4 are filtered out.
### User can define the thresholds based on inspecting their data.
### **This step is optional.** We suggest starting without running this step to see whether there are many doublets/triplets.
CelestaObj <- FilterCells(CelestaObj,high_marker_threshold=0.9, low_marker_threshold=0.4)
### Assign cell types.
### max_iteration is used to define the maximum iterations allowed in the EM algorithm per round.
### cell_change_threshold is a user-defined ending condition for the EM algorithm.
### For example, 0.01 means that when fewer than 1% of the total number of cells do not change identity, the algorithm will stop.
CelestaObj <- AssignCells(CelestaObj,max_iteration=10,cell_change_threshold=0.01,
high_expression_threshold_anchor=high_marker_threshold_anchor,
low_expression_threshold_anchor=low_marker_threshold_anchor,
high_expression_threshold_index=high_marker_threshold_iteration,
low_expression_threshold_index=low_marker_threshold_iteration)
### After the AssignCells() function, the CELESTA assigned cell types will be stored in the CelestaObj
### in the field called final_cell_type_assignment with each row corresponding to a cell.
### The final_cell_type_assignment has assignment for each round stored in each column, the final
### cell types and the corresponding cell type number corresponding to the cell type specified in
### the cell-type signature matrix (please see Input section below).
### Plot cells with CELESTA assigned cell types.
### The cell_number_to_use corresponds to the defined numbers in the prior cell-type signature matrix.
### For example, 1 corresponds to endothelial cell, 2 corresponds to tumor cell.
### The program will plot the corresponding cell types given in the "cell_number_to_use" parameter.
### To plot the "unknown" cells that are left unassigned by CELESTA, include 0 in the list.
### The default color for unknown cells is gray.
### The size of the cells plotted can be modified by changing the parameter test_size.
PlotCellsAnyCombination(cell_type_assignment_to_plot=CelestaObj@final_cell_type_assignment[,(CelestaObj@total_rounds+1)],
coords = CelestaObj@coords,
prior_info = prior_marker_info,
cell_number_to_use=c(1,2,3),
cell_type_colors=c("yellow","red","blue"),
test_size=1)
### To include unknown cells
PlotCellsAnyCombination(cell_type_assignment_to_plot=CelestaObj@final_cell_type_assignment[,(CelestaObj@total_rounds+1)],
coords = CelestaObj@coords,
prior_info = prior_marker_info,
cell_number_to_use=c(0,1,2,3),cell_type_colors=c("yellow","red","blue"))
### plot expression probability
PlotExpProb(coords=CelestaObj@coords,
marker_exp_prob=CelestaObj@marker_exp_prob,
prior_marker_info = prior_marker_info,
save_plot = TRUE)
CELESTA requires two inputs:
1. Segmented imaging data
:
a
dataframe with rows as the cells, and needs to have (1) two columns
named X and Y to define the XY coordinates of the cells and (2) other
columns having the protein marker expressions for each cell
Below is an example of the segmented imaging file header
2. User-defined cell-type signature matrix
:
(1) The first column
has to contain the cell types to be inferred
(2) The second column
has the lineage information for each cell type. The lineage information
has three numbers connected by “_” (underscore). The first number
indicates round.Cell types with the same lineage level are inferred at
the same round. Increasing number indicates increase cell-type
resolution. For example, immune cells -> CD3+ T cells –> CD4+ T
cells. The third number is a number assigned to the cell type, i.e, cell
type number. The middle number tells the previous lineage cell type
number for the current cell type. For example, the middle number for
CD3+ T cells is 5, because it is a subtype of immune cells which have
cell type number assigned to 5.
(3) Starting from column three,
each column is a protein marker. If the protein marker is known to be
expressed for that cell type, then it is denoted by “1”. If the protein
marker is known to not express for a cell type, then it is denoted by
“0”. If the protein marker is irrelevant or uncertain to express for a
cell type, then it is denoted by “NA”.
(4) More examples of the
user-defined cell-type signature matrix is provided under
folder:data.
Below is an example of cell-type signature matrix based on imaging panel used in Schurch et al. Cell, 2020.
CELESTA outputs: 1. After running AssignCells()
function, CELESTA will
output a .csv file with the cell type assignment to each cell for each
round and the final combined cell types.
2. In addition, users can
access the results in the CELESTA object under the slot
“final_cell_type_assignment”. The anchor cells defined for each round
can be found under the slot “anchor_cell_type_assignment”.
CELESTA can also plot the assigned cells by using the
PlotCellsAnyCombination()
function. An example output image is shown
below: Users can compare the output
with the original images. An example is shown below:
Please note:
CODEX images preprocess with Akoya Biosciences software stitched the
image tiles in a flipped way. So in some cases, for the comparisons, the
image needs to be flipped.
In the AssignCells()
function, it requires four vectors to define the
high and low thresholds for each cell type. The length of the vector
equals to the total number of cell types defined in the cell-type
signature matrix.Examples of the thresholds are provided under the
folder:data.
We would suggest start with the default thresholds and
modify them by comparing the results with the original staining
demonstrated below.
The two vectors are required for defining the
“high_expression_threshold”, one for anchor cells and one for index
cells(non-anchor cells). The thresholds defined how much the marker
expression probability is in order to be considered as expressed. An
example for defining high_expression_threshold is shown below:
You can also specify the threholds using:
CelestaObj <- AssignCells(CelestaObj,max_iteration=10,cell_change_threshold=0.01,
high_expression_threshold_anchor=c(0.7,0.7,0.7,0.7,0.7,0.8,0.9,0.9),
low_expression_threshold_anchor=c(1,1,1,1,1,1,1,1),
high_expression_threshold_index=c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5),
low_expression_threshold_index=c(1,1,1,1,1,1,1,1))
The length of the vectors for the thresholds correspond to the number of cell types. The order of the thresholds correpond to the same order in the defined cell-type signature matrix.
To find the proper threshold, the PlotExpProb()
function can be
applied. Because the segmented data may have some compensation in the
values which are the inputs to CELESTA, the expression probabilities are
calculated based on the segmented data. It’s useful to compare the
expression probabilities with the CODEX staining for each marker.
For example, for endothelial cells, if we plot the expression
probabilities of CD31 (left) and compare with the CD31 staining,
approximately 0.9 and 0.8 would be the right threshold for defining how
much the cell should express CD31. Please note:
It is suggested that
for anchor cells, use a slightly higher threshold than index cells.
Another example, for
tumor cells, if we plot the expression probabilities of Cytokerain
(left) and compare with the Cytokeratin staining, approximately 0.9 and
0.8 would be the right threshold for defining how much the cell should
express Cytokeratin. Please note:
It is suggested that for anchor
cells, use a slightly higher threshold than index cells.
The two vectors are required for defining the “low_marker_threshold”,
one for anchor cells and one for index cells. The thresholds defined how
much the marker expression probability is in order to be considered as
not expressed. Normally 1 is assigned to this value unless there are a
lot of doublets or co-staining in the data. The Low expression
threshold default values in general are robust, and thus we recommend
testing the High expression threshold values.
An example for defining low_marker_threshold is shown below:
If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use community.rstudio.com.