Code to analyze effects of gene overexpression on the transcriptome of human stem cells. The study by Parekh, Wu et al, Cell Systems, 2018, can be found here. Raw data files are at Gene Expression Omnibus with the accession: GSE107185 and processed data is available here
Both clustering and regression analysis can take quite a while (8+ hours for large datasets), and so are best run from the command line. You will need the perturbLM, swne, and cellMapper packages to run this analysis.
[counts_matrix]
is the gene expression counts matrix generated by the 10X Chromium pipeline[genotype_dictionary].csv
is the genotype to cells dictionary generated by https://github.com/yanwu2014/genotyping-matrices[output_file].RData
is the output RData file that will be generated by the analysis[batch_regression]
is whether or to run batch correction, either "T" for true, or "F" for false.
Other important parameters can be found in the scripts themselves. The parameters are set to the values used in the publication
Clustering analysis:
Rscript hESC_run_clustering.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]
For clustering analysis with more powerful batch correction, run:
Rscript hESC_manifold_alignment.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]
Regression analysis:
Rscript hESC_run_regression.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]
The higher level analyses (gene modules, etc.) and the dropout analyses are relatively quick and are best if used with RStudio: https://www.rstudio.com/.
Run the script hESC_summarize_results.R
with the file.handle
parameter set to the name of the sample to be analyzed. This section will generate a summary table for each sample, as well as cluster enrichment heatmaps, tSNE plots, and differential expression barplots. The hESC_run_regression.R
and hESC_run_clustering.R
must have been run on the sample first.
All of the dropout analyses in hESC_dropout_analysis.R
can be run after generating the summary tables for each sample. This script will generate heatmaps showing fitness change across all three media conditions as calculated from genomic DNA reads, and from scRNA-seq cell counts. It will also generate scatterplots showing the correlation between genomic DNA fitness and scRNA-seq cell counts fitness, and scatterplots showing correlation between genomic DNA fitness and the number of significantly differentially expressed genes for each transcription factor.
The code in hESC_coregulation_network.R
generates and clusters the gene-gene co-perturbation network. It can be run after generating the summary tables for each of the media condition samples.
Run the script hESC_ortholog_comparison.R
. Compares the differentially expressed genes from our screen to the screen performed by Nishiyama et al (https://www.ncbi.nlm.nih.gov/pubmed/19796622) by mapping mouse genes to human orthologs and running a GSEA enrichment.
Run the script hESC_EMT_analysis.R
. Epithelial mesenchymal transition (EMT) analysis. Runs PCA on known EMT genes and creates a PCA plot showing the separation of KLF4 and SNAI2 cells. Also generates a barplot showing the up/downregulation of canonical EMT markers for KLF4 and SNAI2.
Run hESC_dataset_correlation.R
to correlate different screens and demonstrate reproducibility. There are 2 replicate screens for each media condition.
Run hESC_klf_myc_analysis.R
to analyze the KLF family and MYC mutant screens.