GitHub - sifrimlab/multi-verse

Multi-verse

A package for comparing MOFA, MOWGLI, MultiVI, and PCA on multimodal datasets, providing scIB metrics and UMAP visualizations.

Report Bug · Add Feature

Logo generated with the help of ChatGPT.

About The Project

Multi-verse is a Python package designed to facilitate the comparison of multimodal data integration methods, specifically MOFA, MOWGLI, MultiVI, and PCA. By leveraging scIB metrics and generating UMAP visualizations, this package enables researchers to assess and visualize the performance of these methods on their datasets.

Key features:

Supports comparison of four major methods: MOFA+, Mowgli, MultiVI, and PCA.
Provides scIB metrics for integration performance evaluation.
Generates UMAP visualizations for easy interpretation of results.

Getting Started

To get a local copy up and running follow steps below.

Prerequisites

It is recommended to create a new virtual enviroment with conda.

cmake is required for louvain package to be installed properly.

Installation

Clone the repository:

git clone https://github.com/sifrimlab/multi-verse.git
cd multi-verse

Create a new conda environment:

conda env create -f environment.yml
conda activate multiverse

Usage

To run the script, provide a configuration JSON file as an argument. The configuration file should include all necessary settings for the methods and metrics you want to compare. See "Practicalities" for more information and the config.json for example structure. It includes utilities for preprocessing data, hyperparameter tuning, and evaluation of model performance.
Run the code (with exmaple config.json file):
```
python main.py config.json
```

Practicalities

Model Overview

Model	Pairing Type	Methodology	Hyperparameter Evaluation Metric	Supports scIB Metrics
PCA	Unpaired	Linear Dimensionality Reduction	Variance Score	Yes
MOFA+	Paired	Variational Inference	Variance Score	Yes
MultiVI	Paired-guided	Deep Generative Model	Silhouette score	Yes
Mowgli	Paired	Optimal Transport and Nonnegative Matrix Factorization (NMF)	Optimal Transport Loss	Yes

JSON file

The JSON configuration file serves as the blueprint for the pipeline, specifying datasets, preprocessing parameters, and model configurations. Below is a breakdown of the key components of the configuration file:

Top-Level Parameters

_run_user_params: A boolean flag to enable the parameters specified by the user.
_run_gridsearch: A boolean flag to enable or disable parameterized search for hyperparameter optimization.

Datasets

Specifies the datasets used in the pipeline.

dataset_NAME: Represent the dataset. It needs to contain:
- data_path: Directory path where data files are stored.
- rna, atac, and adt: Different modalities (RNA, ATAC, and ADT data).
  - file_name: Name of the data file.
  - is_preprocessed: Whether the data is preprocessed (true or false).
  - annotation: Label for cell types or other metadata.

This pipeline comes preconfigured with two datasets, dataset_Pbmc10k and dataset_TEA, which serve as examples for model comparison or tutorials for getting started with the pipeline. These datasets are already integrated into the configuration file and are ready to use without additional setup.

dataset_Pbmc10k - download here
- Description: A multi-modal dataset featuring RNA and ATAC data from 10,000 Peripheral Blood Mononuclear Cells (PBMCs).
- Data Path: The data is located in the directory specified by data_path.
- Modalities:
  - RNA: 10x-Multiome-Pbmc10k-RNA.h5ad\
  - ATAC: 10x-Multiome-Pbmc10k-ATAC.h5ad
- Annotation: Contains cell type annotations, useful for visualization and evaluation.
dataset_TEA - download here
- Description: A multi-modal dataset with RNA, ATAC, and ADT modalities, originating from a leukopak sample.
- Data Path: The data is located in the directory specified by data_path.
- Modalities:
  - RNA: GSM4949911_X061-AP0C1W1_leukopak_perm-cells_tea_fulldepth_cellranger-arc_filtered_feature_bc_matrix.h5
  - ATAC: Same file as RNA, as ATAC peaks are included.
  - ADT: GSM4949911_tea_fulldepth_adt_counts.csv.gz
- Annotation: This dataset does not include pre-defined annotations but is ideal for testing multi-modal capabilities.

Model

Configures the models and their hyperparameters.

The model flags allows to pick the specific models to be run

is_mofa+, is_pca, is_multivi, is_mowgli: Enable/disable specific models using a boolean function

Model-specific settings:

Key hyperparameters for respective models vary between models and need to be correctly specified for the _run_user_params
device: Specifies computation hardware (cpu or cuda:).
grid_search_params: Takes a set of hyperparameters specified by the user for parameterized grid search using _run_gridsearch

Prerpocessing of modalities

In the preprocess_params the preprocessing parameters need to be specified for RNA, ATAC, and ADT data.

RNA and ATAC:
- min_genes_by_counts, max_genes_by_counts, normalization_target_sum, etc.: Parameters for filtering and normalization.
ADT:
per_cell_normalization: Enables normalization for ADT data.

The device to be used for modality preprocessing needs to be specified in the device section at the end:

device: Specifies the default device (cpu or gpu) for training.

Results Format

Gridsearch

For the grid search, the UMAP and latent embeddings are generated and saved only for the best model for each model-dataset combination after completing the grid search for that combination. The folder is saved in the ./outputs/gridsearch_output folder. Finally the summary of the gridsearch results is printed in the console. There the value of the best score and parameters for each model-dataset combination.

Evaluation

The evaluation process assesses the performance of each model using several metrics using scIB-metrics, applied to the latent embeddings generated during the training. Results are summarized for each model-dataset combination and saved in ./outputs/results.json file.

The following metrics are calculated using the scib.metrics.metrics module:

Adjusted Rand Index (ARI): Measures clustering accuracy compared to known annotations.
Normalized Mutual Information (NMI): Evaluates the agreement between cluster assignments and annotations.
Silhouette Score: Assesses the quality of clustering in terms of sample separation.
Graph Connectivity (Graph Conn): Evaluates batch mixing and integration effectiveness.
Isolated Labels Silhouette Score (Isolated ASW): Quantifies how well isolated clusters are preserved after integration.

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the GPL-3 License. See LICENSE for more information.

Contact

Project Link: https://github.com/sifrimlab/multi-verse

Contributors

This project was developed as part of the Integrated Bioinformatics Project (B-KUL-I0U20A) course at the Faculty of Bioscience Engineering, KU Leuven.

Authors

Yuxin Qiu

Thi Hanh Nguyen Ly

Zuzanna Olga Bednarska

Supervisors

Anis Ismail

Lorenzo Venturelli

Promotor

Prof. Alejandro Sifrim

Course Coordinator

Prof. Vera van Noort

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.ipython/profile_default/startup		.ipython/profile_default/startup
CoboltModel		CoboltModel
__pycache__		__pycache__
outputs		outputs
README.md		README.md
config.py		config.py
config_alldatasets.json		config_alldatasets.json
dataloader.py		dataloader.py
environment.yml		environment.yml
eval.py		eval.py
logo.png		logo.png
main.py		main.py
model.py		model.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-verse

Table of Contents

About The Project

Getting Started

Prerequisites

Installation

Usage

Practicalities

Model Overview

JSON file

Top-Level Parameters

Datasets

Model

Prerpocessing of modalities

Results Format

Gridsearch

Evaluation

Contributing

License

Contact

Contributors

Authors

Supervisors

Promotor

Course Coordinator

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

sifrimlab/multi-verse

Folders and files

Latest commit

History

Repository files navigation

Multi-verse

Table of Contents

About The Project

Getting Started

Prerequisites

Installation

Usage

Practicalities

Model Overview

JSON file

Top-Level Parameters

Datasets

Model

Prerpocessing of modalities

Results Format

Gridsearch

Evaluation

Contributing

License

Contact

Contributors

Authors

Supervisors

Promotor

Course Coordinator

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages