Skip to content

Code to analyze effects of gene overexpression on the transcriptome of human stem cells

License

Notifications You must be signed in to change notification settings

yanwu2014/SEUSS-Analysis

Repository files navigation

SEUSS-Analysis

Code to analyze effects of gene overexpression on the transcriptome of human stem cells. The study by Parekh, Wu et al, Cell Systems, 2018, can be found here. Raw data files are at Gene Expression Omnibus with the accession: GSE107185 and processed data is available here

Clustering and regression analysis

Both clustering and regression analysis can take quite a while (8+ hours for large datasets), and so are best run from the command line. You will need the perturbLM, swne, and cellMapper packages to run this analysis.

Command line parameters

  • [counts_matrix] is the gene expression counts matrix generated by the 10X Chromium pipeline
  • [genotype_dictionary].csv is the genotype to cells dictionary generated by https://github.com/yanwu2014/genotyping-matrices
  • [output_file].RData is the output RData file that will be generated by the analysis
  • [batch_regression] is whether or to run batch correction, either "T" for true, or "F" for false.

Other important parameters can be found in the scripts themselves. The parameters are set to the values used in the publication

Usage:

Clustering analysis:

Rscript hESC_run_clustering.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]

For clustering analysis with more powerful batch correction, run:

Rscript hESC_manifold_alignment.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]

Regression analysis:

Rscript hESC_run_regression.R [counts_matrix].tsv [genotype_dictionary].csv [output_file].RData [batch_regression]

Offline analysis

The higher level analyses (gene modules, etc.) and the dropout analyses are relatively quick and are best if used with RStudio: https://www.rstudio.com/.

Summarizing differential expression and clustering.

Run the script hESC_summarize_results.R with the file.handle parameter set to the name of the sample to be analyzed. This section will generate a summary table for each sample, as well as cluster enrichment heatmaps, tSNE plots, and differential expression barplots. The hESC_run_regression.R and hESC_run_clustering.R must have been run on the sample first.

Dropout Analyses

All of the dropout analyses in hESC_dropout_analysis.R can be run after generating the summary tables for each sample. This script will generate heatmaps showing fitness change across all three media conditions as calculated from genomic DNA reads, and from scRNA-seq cell counts. It will also generate scatterplots showing the correlation between genomic DNA fitness and scRNA-seq cell counts fitness, and scatterplots showing correlation between genomic DNA fitness and the number of significantly differentially expressed genes for each transcription factor.

Co-perturbation network and gene module analysis

The code in hESC_coregulation_network.R generates and clusters the gene-gene co-perturbation network. It can be run after generating the summary tables for each of the media condition samples.

Orthologous mouse TF enrichment

Run the script hESC_ortholog_comparison.R. Compares the differentially expressed genes from our screen to the screen performed by Nishiyama et al (https://www.ncbi.nlm.nih.gov/pubmed/19796622) by mapping mouse genes to human orthologs and running a GSEA enrichment.

Epithelial-mesenchymal transition (EMT) analysis

Run the script hESC_EMT_analysis.R. Epithelial mesenchymal transition (EMT) analysis. Runs PCA on known EMT genes and creates a PCA plot showing the separation of KLF4 and SNAI2 cells. Also generates a barplot showing the up/downregulation of canonical EMT markers for KLF4 and SNAI2.

Correlate screens to establish reproducibility

Run hESC_dataset_correlation.R to correlate different screens and demonstrate reproducibility. There are 2 replicate screens for each media condition.

MYC mutant & KLF family analysis

Run hESC_klf_myc_analysis.R to analyze the KLF family and MYC mutant screens.

About

Code to analyze effects of gene overexpression on the transcriptome of human stem cells

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages