Denoising

Removing noise in sparse single-cell RNA-sequencing count data

Repository: openproblems-bio/task_denoising

Description

A key challenge in evaluating denoising methods is the general lack of a ground truth. A recent benchmark study (Hou et al., 2020) relied on flow-sorted datasets, mixture control experiments (Tian et al., 2019), and comparisons with bulk RNA-Seq data. Since each of these approaches suffers from specific limitations, it is difficult to combine these different approaches into a single quantitative measure of denoising accuracy. Here, we instead rely on an approach termed molecular cross-validation (MCV), which was specifically developed to quantify denoising accuracy in the absence of a ground truth (Batson et al., 2019). In MCV, the observed molecules in a given scRNA-Seq dataset are first partitioned between a training and a test dataset. Next, a denoising method is applied to the training dataset. Finally, denoising accuracy is measured by comparing the result to the test dataset. The authors show that both in theory and in practice, the measured denoising accuracy is representative of the accuracy that would be obtained on a ground truth dataset.

Authors & contributors

name	roles
Wesley Lewis	author, maintainer
Scott Gigante	author, maintainer
Robrecht Cannoodt	author
Kai Waldrant	contributor

API

flowchart LR
  file_common_dataset("Common Dataset")
  comp_data_processor[/"Data processor"/]
  file_test("Test data")
  file_train("Training data")
  comp_control_method[/"Control Method"/]
  comp_metric[/"Metric"/]
  comp_method[/"Method"/]
  file_prediction("Denoised data")
  file_score("Score")
  file_common_dataset---comp_data_processor
  comp_data_processor-->file_test
  comp_data_processor-->file_train
  file_test---comp_control_method
  file_test---comp_metric
  file_train---comp_control_method
  file_train---comp_method
  comp_control_method-->file_prediction
  comp_metric-->file_score
  comp_method-->file_prediction
  file_prediction---comp_metric

File format: Common Dataset

A subset of the common dataset.

Example file: resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad

Format:

AnnData object
 obs: 'batch'
 layers: 'counts'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Data structure:

Slot	Type	Description
`obs["batch"]`	`string`	(Optional) Batch information.
`layers["counts"]`	`integer`	Raw counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.

Component type: Data processor

A denoising dataset processor.

Arguments:

Name	Type	Description
`--input`	`file`	A subset of the common dataset.
`--output_train`	`file`	(Output) The subset of molecules used for the training dataset.
`--output_test`	`file`	(Output) The subset of molecules used for the test dataset.

File format: Test data

The subset of molecules used for the test dataset

Example file: resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad

Format:

AnnData object
 layers: 'counts'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'

Data structure:

Slot	Type	Description
`layers["counts"]`	`integer`	Raw counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["train_sum"]`	`integer`	The total number of counts in the training dataset.

File format: Training data

The subset of molecules used for the training dataset

Example file: resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad

Format:

AnnData object
 layers: 'counts'
 uns: 'dataset_id'

Data structure:

Slot	Type	Description
`layers["counts"]`	`integer`	Raw counts.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.

Component type: Control Method

A control method.

Arguments:

Name	Type	Description
`--input_train`	`file`	The subset of molecules used for the training dataset.
`--input_test`	`file`	The subset of molecules used for the test dataset.
`--output`	`file`	(Output) A denoised dataset as output by a method.

Component type: Metric

A metric.

Arguments:

Name	Type	Description
`--input_test`	`file`	The subset of molecules used for the test dataset.
`--input_prediction`	`file`	A denoised dataset as output by a method.
`--output`	`file`	(Output) File indicating the score of a metric.

Component type: Method

A method.

Arguments:

Name	Type	Description
`--input_train`	`file`	The subset of molecules used for the training dataset.
`--output`	`file`	(Output) A denoised dataset as output by a method.

File format: Denoised data

A denoised dataset as output by a method.

Example file: resources_test/task_denoising/cxg_mouse_pancreas_atlas/denoised.h5ad

Format:

AnnData object
 layers: 'denoised'
 uns: 'dataset_id', 'method_id'

Data structure:

Slot	Type	Description
`layers["denoised"]`	`integer`	denoised data.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.

File format: Score

File indicating the score of a metric.

Example file: resources_test/task_denoising/cxg_mouse_pancreas_atlas/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot	Type	Description
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["metric_ids"]`	`string`	One or more unique metric identifiers.
`uns["metric_values"]`	`double`	The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
common @ e64f472		common @ e64f472
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
README.md		README.md
_viash.yaml		_viash.yaml
main.nf		main.nf
nextflow.config		nextflow.config
thumbnail.svg		thumbnail.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Denoising

Description

Authors & contributors

API

File format: Common Dataset

Component type: Data processor

File format: Test data

File format: Training data

Component type: Control Method

Component type: Metric

Component type: Method

File format: Denoised data

File format: Score

About

Releases

Packages

Contributors 2

Languages

License

openproblems-bio/task_denoising

Folders and files

Latest commit

History

Repository files navigation

Denoising

Description

Authors & contributors

API

File format: Common Dataset

Component type: Data processor

File format: Test data

File format: Training data

Component type: Control Method

Component type: Metric

Component type: Method

File format: Denoised data

File format: Score

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages