GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

📑 [2025-06-05] Our paper is now available on bioRxiv!
🤗 [2025-05-10] Our expert model for metagenomic annotation GENERanno-prokaryote-0.5b-cds-annotator is now available on HuggingFace!
🤗 [2025-02-11] Our models GENERanno-prokaryote-0.5b-base, GENERanno-eukaryote-0.5b-base are now available on HuggingFace!

🔭 Overview

In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).

Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

In this repository, you will find the following model checkpoints:

Model Name	Parameters	Data	Category	Status
`GENERanno-eukaryote-0.5b-base`	0.5B	386B	Eukaryote	Available
`GENERanno-prokaryote-0.5b-base`	0.5B	715B	Prokaryote	Available
`GENERanno-eukaryote-1b-base`	1B	386B	Eukaryote	Awaiting sponsorship
`GENERanno-prokaryote-1b-base`	1B	715B	Prokaryote	Awaiting sponsorship

📈 Benchmark Performance

Coding DNA Sequence (CDS) Annotation — `GENERanno-prokaryote-0.5b-cds-annotator-preview`

The detailed annotation results are provived here.

Sequence Understanding (Classification/Regression) — `GENERanno-prokaryote-0.5b-base`

Sequence Understanding (Classification/Regression) — `GENERanno-eukaryote-0.5b-base`

🎯 Quick Start

Dependencies

Clone this repo, cd into it

git clone https://github.com/GenerTeam/GENERanno.git
cd GENERanno

Install requirements with Python 3.10

pip install -r requirements.txt

If your network cannot access huggingface.co normally, we recommend using the following mirror:
export HF_ENDPOINT=https://hf-mirror.com

Downstream

Coding DNA Sequence (CDS) Annotation

To run the coding sequence annotation task on our cds annotation dataset, you can use the following command:

# Using single GPU
python src/tasks/downstream/cds_annotation.py

# Using multiple GPUs (Data Parallel)
python src/tasks/downstream/cds_annotation.py --dp_size ${NUM_GPUS}

Sequence Understanding (Classification/Regression)

To run the sequence understanding task on Gener Tasks, Prokaryotic Gener Tasks, NT Tasks, Genomic Benchmarks, DeepSTARR Enhancer, you can use the following arguments:

Gener Tasks / Prokaryotic Gener Tasks
- --dataset_name GenerTeam/gener-tasks or --dataset_name GenerTeam/prokaryotic-gener-tasks
- --subset_name gene_classification or --subset_name taxonomic_classification or ...
NT Tasks
- --dataset_name InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
- --subset_name H2AFZ or --subset_name H3K27ac or ...
Genomic Benchmarks
- --dataset_name katarinagresova/Genomic_Benchmarks_demo_human_or_worm or --dataset_name katarinagresova/Genomic_Benchmarks_human_ocr_ensembl or ...
DeepSTARR Enhancer Activity
- --dataset_name GenerTeam/DeepSTARR-enhancer-activity
- --problem_type regression

on following command:

# Using single GPU
python src/tasks/downstream/sequence_understanding.py \
    --model_name GenerTeam/GENERator-eukaryote-1.2b-base \
    --dataset_name ${DATASET_NAME} \
    --subset_name ${SUBSET_NAME} \
    --batch_size ${BATCH_SIZE} \
    --problem_type ${PROBLEM_TYPE} \
    --main_metrics ${MAIN_METRICS}

# Using multiple GPUs on single node (DDP)
torchrun --nnodes=1 \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    src/tasks/downstream/sequence_understanding.py

# Using multiple GPUs on multiple nodes (DDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py

# Using DeepSpeed or Full Sharded Data Parallel (FSDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py \
    --distributed_type deepspeed # or fsdp

📚 Datasets

📜 Citation

@article{li2025generanno,
	author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
	title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
	elocation-id = {2025.06.04.656517},
	year = {2025},
	doi = {10.1101/2025.06.04.656517},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
configs		configs
figures		figures
src/tasks/downstream		src/tasks/downstream
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

🔭 Overview

📈 Benchmark Performance

Coding DNA Sequence (CDS) Annotation — `GENERanno-prokaryote-0.5b-cds-annotator-preview`

Sequence Understanding (Classification/Regression) — `GENERanno-prokaryote-0.5b-base`

Sequence Understanding (Classification/Regression) — `GENERanno-eukaryote-0.5b-base`

🎯 Quick Start

Dependencies

Downstream

Coding DNA Sequence (CDS) Annotation

Sequence Understanding (Classification/Regression)

📚 Datasets

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

GenerTeam/GENERanno

Folders and files

Latest commit

History

Repository files navigation

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

🔭 Overview

📈 Benchmark Performance

Coding DNA Sequence (CDS) Annotation — GENERanno-prokaryote-0.5b-cds-annotator-preview

Sequence Understanding (Classification/Regression) — GENERanno-prokaryote-0.5b-base

Sequence Understanding (Classification/Regression) — GENERanno-eukaryote-0.5b-base

🎯 Quick Start

Dependencies

Downstream

Coding DNA Sequence (CDS) Annotation

Sequence Understanding (Classification/Regression)

📚 Datasets

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Coding DNA Sequence (CDS) Annotation — `GENERanno-prokaryote-0.5b-cds-annotator-preview`

Sequence Understanding (Classification/Regression) — `GENERanno-prokaryote-0.5b-base`

Sequence Understanding (Classification/Regression) — `GENERanno-eukaryote-0.5b-base`

Packages