A collaboration between scverse, Lamin, and anyone interested in contributing!
This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results.
Setup:
git clone https://github.com/laminlabs/arrayloader-benchmarks
cd arrayloader-benchmarks
uv pip install -e ".[scdataset,annbatch]" # provide tools you'd like to install
lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance, call `lamin init` to create a new lamindb instanceTypical calls of the main benchmarking script are:
cd scripts
python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1
python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection
python run_loading_benchmark_on_collection.py scDataset # run scDataset
python run_loading_benchmark_on_collection.py annbatch --n_datasets -1 # run against all datasets in the collection
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cellsYou can choose between different benchmarking dataset collections.
When running the script, parameters and results are automatically tracked in a parquet file, along with source code, run environment, and input and output datasets.
Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: lamin.ai/blog/arrayloader-benchmarks.