Skip to content

sifrimlab/multi-verse

Repository files navigation

Contributors Forks Stargazers Issues MIT Liscence

Logo

Multi-verse

A package for comparing MOFA, MOWGLI, MultiVI, and PCA on multimodal datasets, providing scIB metrics and UMAP visualizations.

Report Bug · Add Feature

Logo generated with the help of ChatGPT.

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Practicalities
  5. Contributing
  6. License
  7. Contact
  8. Contributors

About The Project

Multi-verse is a Python package designed to facilitate the comparison of multimodal data integration methods, specifically MOFA, MOWGLI, MultiVI, and PCA. By leveraging scIB metrics and generating UMAP visualizations, this package enables researchers to assess and visualize the performance of these methods on their datasets.

Key features:

  • Supports comparison of four major methods: MOFA+, Mowgli, MultiVI, and PCA.
  • Provides scIB metrics for integration performance evaluation.
  • Generates UMAP visualizations for easy interpretation of results.

Getting Started

To get a local copy up and running follow steps below.

Prerequisites

It is recommended to create a new virtual enviroment with conda.

cmake is required for louvain package to be installed properly.

Installation

  1. Clone the repository:

    git clone https://github.com/sifrimlab/multi-verse.git
    cd multi-verse
  2. Create a new conda environment:

    conda env create -f environment.yml
    conda activate multiverse

Usage

  1. To run the script, provide a configuration JSON file as an argument. The configuration file should include all necessary settings for the methods and metrics you want to compare. See "Practicalities" for more information and the config.json for example structure. It includes utilities for preprocessing data, hyperparameter tuning, and evaluation of model performance.

  2. Run the code (with exmaple config.json file):

    python main.py config.json

Practicalities

Model Overview

Model Pairing Type Methodology Hyperparameter Evaluation Metric Supports scIB Metrics
PCA Unpaired Linear Dimensionality Reduction Variance Score Yes
MOFA+ Paired Variational Inference Variance Score Yes
MultiVI Paired-guided Deep Generative Model Silhouette score Yes
Mowgli Paired Optimal Transport and Nonnegative Matrix Factorization (NMF) Optimal Transport Loss Yes

JSON file

The JSON configuration file serves as the blueprint for the pipeline, specifying datasets, preprocessing parameters, and model configurations. Below is a breakdown of the key components of the configuration file:

Top-Level Parameters

  • _run_user_params: A boolean flag to enable the parameters specified by the user.

  • _run_gridsearch: A boolean flag to enable or disable parameterized search for hyperparameter optimization.

Datasets

Specifies the datasets used in the pipeline.

  • dataset_NAME: Represent the dataset. It needs to contain:
    • data_path: Directory path where data files are stored.
    • rna, atac, and adt: Different modalities (RNA, ATAC, and ADT data).
      • file_name: Name of the data file.
      • is_preprocessed: Whether the data is preprocessed (true or false).
      • annotation: Label for cell types or other metadata.

This pipeline comes preconfigured with two datasets, dataset_Pbmc10k and dataset_TEA, which serve as examples for model comparison or tutorials for getting started with the pipeline. These datasets are already integrated into the configuration file and are ready to use without additional setup.

  • dataset_Pbmc10k - download here

    • Description: A multi-modal dataset featuring RNA and ATAC data from 10,000 Peripheral Blood Mononuclear Cells (PBMCs).
    • Data Path: The data is located in the directory specified by data_path.
    • Modalities:
      • RNA: 10x-Multiome-Pbmc10k-RNA.h5ad\
      • ATAC: 10x-Multiome-Pbmc10k-ATAC.h5ad
    • Annotation: Contains cell type annotations, useful for visualization and evaluation.
  • dataset_TEA - download here

    • Description: A multi-modal dataset with RNA, ATAC, and ADT modalities, originating from a leukopak sample.
    • Data Path: The data is located in the directory specified by data_path.
    • Modalities:
      • RNA: GSM4949911_X061-AP0C1W1_leukopak_perm-cells_tea_fulldepth_cellranger-arc_filtered_feature_bc_matrix.h5
      • ATAC: Same file as RNA, as ATAC peaks are included.
      • ADT: GSM4949911_tea_fulldepth_adt_counts.csv.gz
    • Annotation: This dataset does not include pre-defined annotations but is ideal for testing multi-modal capabilities.

Model

Configures the models and their hyperparameters.

The model flags allows to pick the specific models to be run

  • is_mofa+, is_pca, is_multivi, is_mowgli: Enable/disable specific models using a boolean function

Model-specific settings:

  • Key hyperparameters for respective models vary between models and need to be correctly specified for the _run_user_params
  • device: Specifies computation hardware (cpu or cuda:).
  • grid_search_params: Takes a set of hyperparameters specified by the user for parameterized grid search using _run_gridsearch

Prerpocessing of modalities

In the preprocess_params the preprocessing parameters need to be specified for RNA, ATAC, and ADT data.

  • RNA and ATAC:
    • min_genes_by_counts, max_genes_by_counts, normalization_target_sum, etc.: Parameters for filtering and normalization.
  • ADT:
  • per_cell_normalization: Enables normalization for ADT data.

The device to be used for modality preprocessing needs to be specified in the device section at the end:

  • device: Specifies the default device (cpu or gpu) for training.

Results Format

Gridsearch

For the grid search, the UMAP and latent embeddings are generated and saved only for the best model for each model-dataset combination after completing the grid search for that combination. The folder is saved in the ./outputs/gridsearch_output folder. Finally the summary of the gridsearch results is printed in the console. There the value of the best score and parameters for each model-dataset combination.

Evaluation

The evaluation process assesses the performance of each model using several metrics using scIB-metrics, applied to the latent embeddings generated during the training. Results are summarized for each model-dataset combination and saved in ./outputs/results.json file.

The following metrics are calculated using the scib.metrics.metrics module:

  • Adjusted Rand Index (ARI): Measures clustering accuracy compared to known annotations.
  • Normalized Mutual Information (NMI): Evaluates the agreement between cluster assignments and annotations.
  • Silhouette Score: Assesses the quality of clustering in terms of sample separation.
  • Graph Connectivity (Graph Conn): Evaluates batch mixing and integration effectiveness.
  • Isolated Labels Silhouette Score (Isolated ASW): Quantifies how well isolated clusters are preserved after integration.

Contributing

Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the GPL-3 License. See LICENSE for more information.

Contact

Project Link: https://github.com/sifrimlab/multi-verse

Contributors

This project was developed as part of the Integrated Bioinformatics Project (B-KUL-I0U20A) course at the Faculty of Bioscience Engineering, KU Leuven.

Authors

Yuxin Qiu

Thi Hanh Nguyen Ly

Zuzanna Olga Bednarska

Supervisors

Anis Ismail

Lorenzo Venturelli

Promotor

Prof. Alejandro Sifrim

Course Coordinator

Prof. Vera van Noort

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •