A package for comparing MOFA, MOWGLI, MultiVI, and PCA on multimodal datasets, providing scIB metrics and UMAP visualizations.
Logo generated with the help of ChatGPT.
Multi-verse is a Python package designed to facilitate the comparison of multimodal data integration methods, specifically MOFA, MOWGLI, MultiVI, and PCA. By leveraging scIB metrics and generating UMAP visualizations, this package enables researchers to assess and visualize the performance of these methods on their datasets.
Key features:
- Supports comparison of four major methods: MOFA+, Mowgli, MultiVI, and PCA.
- Provides scIB metrics for integration performance evaluation.
- Generates UMAP visualizations for easy interpretation of results.
To get a local copy up and running follow steps below.
It is recommended to create a new virtual enviroment with conda.
cmake is required for louvain package to be installed properly.
-
Clone the repository:
git clone https://github.com/sifrimlab/multi-verse.git cd multi-verse
-
Create a new conda environment:
conda env create -f environment.yml conda activate multiverse
-
To run the script, provide a configuration JSON file as an argument. The configuration file should include all necessary settings for the methods and metrics you want to compare. See "Practicalities" for more information and the config.json for example structure. It includes utilities for preprocessing data, hyperparameter tuning, and evaluation of model performance.
-
Run the code (with exmaple config.json file):
python main.py config.json
Model | Pairing Type | Methodology | Hyperparameter Evaluation Metric | Supports scIB Metrics |
---|---|---|---|---|
PCA | Unpaired | Linear Dimensionality Reduction | Variance Score | Yes |
MOFA+ | Paired | Variational Inference | Variance Score | Yes |
MultiVI | Paired-guided | Deep Generative Model | Silhouette score | Yes |
Mowgli | Paired | Optimal Transport and Nonnegative Matrix Factorization (NMF) | Optimal Transport Loss | Yes |
The JSON configuration file serves as the blueprint for the pipeline, specifying datasets, preprocessing parameters, and model configurations. Below is a breakdown of the key components of the configuration file:
-
_run_user_params: A boolean flag to enable the parameters specified by the user.
-
_run_gridsearch: A boolean flag to enable or disable parameterized search for hyperparameter optimization.
Specifies the datasets used in the pipeline.
- dataset_NAME: Represent the dataset. It needs to contain:
- data_path: Directory path where data files are stored.
- rna, atac, and adt: Different modalities (RNA, ATAC, and ADT data).
- file_name: Name of the data file.
- is_preprocessed: Whether the data is preprocessed (true or false).
- annotation: Label for cell types or other metadata.
This pipeline comes preconfigured with two datasets, dataset_Pbmc10k and dataset_TEA, which serve as examples for model comparison or tutorials for getting started with the pipeline. These datasets are already integrated into the configuration file and are ready to use without additional setup.
-
dataset_Pbmc10k - download here
- Description: A multi-modal dataset featuring RNA and ATAC data from 10,000 Peripheral Blood Mononuclear Cells (PBMCs).
- Data Path: The data is located in the directory specified by data_path.
- Modalities:
- RNA: 10x-Multiome-Pbmc10k-RNA.h5ad\
- ATAC: 10x-Multiome-Pbmc10k-ATAC.h5ad
- Annotation: Contains cell type annotations, useful for visualization and evaluation.
-
dataset_TEA - download here
- Description: A multi-modal dataset with RNA, ATAC, and ADT modalities, originating from a leukopak sample.
- Data Path: The data is located in the directory specified by data_path.
- Modalities:
- RNA: GSM4949911_X061-AP0C1W1_leukopak_perm-cells_tea_fulldepth_cellranger-arc_filtered_feature_bc_matrix.h5
- ATAC: Same file as RNA, as ATAC peaks are included.
- ADT: GSM4949911_tea_fulldepth_adt_counts.csv.gz
- Annotation: This dataset does not include pre-defined annotations but is ideal for testing multi-modal capabilities.
Configures the models and their hyperparameters.
The model flags allows to pick the specific models to be run
- is_mofa+, is_pca, is_multivi, is_mowgli: Enable/disable specific models using a boolean function
Model-specific settings:
- Key hyperparameters for respective models vary between models and need to be correctly specified for the _run_user_params
- device: Specifies computation hardware (cpu or cuda:).
- grid_search_params: Takes a set of hyperparameters specified by the user for parameterized grid search using _run_gridsearch
In the preprocess_params the preprocessing parameters need to be specified for RNA, ATAC, and ADT data.
- RNA and ATAC:
- min_genes_by_counts, max_genes_by_counts, normalization_target_sum, etc.: Parameters for filtering and normalization.
- ADT:
- per_cell_normalization: Enables normalization for ADT data.
The device to be used for modality preprocessing needs to be specified in the device section at the end:
- device: Specifies the default device (cpu or gpu) for training.
For the grid search, the UMAP and latent embeddings are generated and saved only for the best model for each model-dataset combination after completing the grid search for that combination. The folder is saved in the ./outputs/gridsearch_output folder. Finally the summary of the gridsearch results is printed in the console. There the value of the best score and parameters for each model-dataset combination.
The evaluation process assesses the performance of each model using several metrics using scIB-metrics, applied to the latent embeddings generated during the training. Results are summarized for each model-dataset combination and saved in ./outputs/results.json file.
The following metrics are calculated using the scib.metrics.metrics module:
- Adjusted Rand Index (ARI): Measures clustering accuracy compared to known annotations.
- Normalized Mutual Information (NMI): Evaluates the agreement between cluster assignments and annotations.
- Silhouette Score: Assesses the quality of clustering in terms of sample separation.
- Graph Connectivity (Graph Conn): Evaluates batch mixing and integration effectiveness.
- Isolated Labels Silhouette Score (Isolated ASW): Quantifies how well isolated clusters are preserved after integration.
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the GPL-3 License. See LICENSE
for more information.
Project Link: https://github.com/sifrimlab/multi-verse
This project was developed as part of the Integrated Bioinformatics Project (B-KUL-I0U20A) course at the Faculty of Bioscience Engineering, KU Leuven.
Anis Ismail
Lorenzo Venturelli
Prof. Alejandro Sifrim
Prof. Vera van Noort