Skip to content

Latest commit

 

History

History
102 lines (66 loc) · 8.89 KB

README.md

File metadata and controls

102 lines (66 loc) · 8.89 KB

Sampling time-dependent artifacts in single-cell genomics studies

This repository contains all the scripts, notebooks and reports to reproduce the scRNA-seq analysis of our paper "Sampling time-dependent artifacts in single-cell genomics studies", published in Genome Biology in 2020. Here, we describe how to access the data, the most important packages and versions used, and how to navigate the directories and files in this repository.

Data

All the raw data (fastqs) and expression matrices are available at the Gene Expression Omnibus (GEO) under GSE132065. The data in this project can be broadly divided into 5 subprojects:

  • Smart-seq2: includes a total of 4 96-well plates, with ids P2568, P2664, P2671 and P2672.
  • 10X scRNA-seq data for Peripheral Blood Mononuclear Cells (PBMC): divided into two batches, which we named "JULIA_03" (cDNA libraries: AH9225 and AH9226, hashtag oligonucleotide (HTO) libraries: AH9223 and AH9224) and "JULIA_04" (cDNA libraries: AI0101 and AI0102, HTO libraries: AI0099 and AI0100).
  • 10X scRNA-seq data for Chronic Lymphocytic Leukemia (CLL) cells: a total of 5 libraries, which are named after a combination of the donor id ("1220", "1472", "1892") and the temperature ("4ºC" or room temperature (RT)): 1220_RT, 1472_RT, 1892_RT, 1472_4C and 1892_RT.
  • 10X scRNA-seq data for T-cell activation experiment (see methods): "Tcell_activation_day0_rep1", "Tcell_activation_day2_rep1", "Tcell_activation_day0_rep2" and "Tcell_activation_day1_rep2".
  • 10X scATAC-seq data for PBMC.
  • 10X scATAC-seq data for CLL.

Fastqs

As described in the paper, we multiplexed several sampling times into the same 10X Chip Channel using the cell hashing technology. To map the fastqs to the reference genome to obtain the single-cell gene expression matrices, we followed the "Feature Barcoding Analysis" pipeline from cellranger. This is an example of a cellranger run we used to map one of the libraries:

cellranger count --libraries libraries.csv --feature-ref feature_reference.csv --id 1472_RT --chemistry SC3Pv3 --expect-cells 5000 --localcores 12 --localmem 64 --transcriptome eference/human/refdata-cellranger-GRCh38-3.0.0/;

As you can see, a key input in this command is the feature_reference.csv which, according to 10X, "declares the set of Feature Barcoding reagents in use in the experiment. For each unique Feature Barcode used, this file declares a feature name and identifier, the unique Feature Barcode sequence associated with this reagent, and a pattern indicating how to extract the Feature Barcode sequence from the read sequence". This files can be easily created from the file "GSE132065_conditions_10X.tsv", available in both this GitHub repository and in GEO.

Expression matrices

A total of 3 files per library are needed to reconstruct the full expression matrix:

  1. barcodes*.tsv.gz: corresponds to the cell barcodes (column names).
  2. features*.tsv.gz: corresponds to the gene/condition identifiers (row names). Moreover, it contains a columns that ideantifes genes ("Gene Expression") and experimental conditions ("Antibody Capture").
  3. matrix*mtx.gz: expression matrix in sparse format.

To make our data as FAIR (findable, accessible, interoperable, reusable) as possible, we have deposited the gene expression matrices and the Seurat objects that are saved in each of the Rmarkdown notebooks in this Zenodo respository. One can download it in 3 lines of code:

wget https://zenodo.org/record/7308457/files/MassoniBadosa2020_GenomeBiol_scRNAseq_data.zip
unzip MassoniBadosa2020_GenomeBiol_scRNAseq_data.zip
cd MassoniBadosa2020_GenomeBiol_scRNAseq_data

The next step after downloading it should be reading the README.md that is inside the MassoniBadosa2020_GenomeBiol_scRNAseq_data folder.

Package versions

These are the versions of the most important packages used throughout all the analysis:

CRAN:

Bioconductor:

Note: Two months before compiling the notebooks to release them together with the paper, we updated most Bioconductor packages. Thus, some versions reported in the sessionInfo() of the notebooks might be slightly different to the ones used to produce the figures of the article.

File system and name scheme

This repository contains 4 different analysis directories (which correspond to the main blocks of the article) and 1 directory with the scripts to produce the figures of the article:

  • 1-PBMC
  • 2-CLL
  • 3-T_cell_activation
  • 4-Revision
  • figures_scripts

The first 3 have a set of similar notebooks, which match the common pre-processing steps of any single-cell expression matrix:

  1. Demultiplexing: classify each cell to its original condition based on the expression of HTO.
  2. QC and normalization: filter out poor-quality cells and genes and normalize expression counts.
  3. Dimensionality reduction, clustering and annotation of cell types.

Each notebook (*.Rmd) has an associated report (*.html). The reports are useful to visualize the results of each section as well as the diagnostic plots that we used to set the thresholds and parameters. For a quick inspection, one can copy the URL of the report in the GitHub & BitBucket HTML Preview (note that side bar and table will not be available with it).

Finally, the figures_scripts directory contains most of the scripts needed to produce the figures as they appear in the paper. The remaining supplementary figures are created either in the notebooks in 4-Revision, or were created by other coauthors.

Other studies

Here is a list of other important benchmarking studies: