ViT-2SPN

Vision Transformer-Based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

Mohammadreza Saraei ¹, Dr. Igor Kozak (Website), Dr. Eung-Joo Lee (Website)

Code [GitHub] | Data [MedMNISTv2] | Preprint [ArXiv] | Publication [َUnder Review in MIDL 2025]

Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, the development of OCT-based diagnostic tools faces challenges such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has advanced OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework featuring a dual-stream network, feature concatenation, and a pretraining mechanism designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Learning, Self-Supervised Pretraining, and Supervised Fine-Tuning. The pretraining phase leverages the unlabeled OCTMNIST dataset with data augmentation to create dual-augmented views, enabling effective feature learning through a ViT backbone and contrastive loss. Fine-tuning is then performed on a limited-annotated subset of OCTMNIST using cross-validation. ViT-2SPN-T achieves a mean AUC of 0.936, accuracy of 0.80, precision of 0.81, recall of 0.80, and an F1-Score of 0.79, outperforming baseline self-supervised learning-based methods. These results highlight the robustness and clinical potential of ViT-2SPN in retinal OCT classification.

Image Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

During the self-supervised pretraining phase, the model utilizes the unlabeled OCTMNIST dataset, which consists of 97k training samples. The training process is conducted with a mini-batch size of 128, a learning rate of 0.0001, and a momentum rate of 0.999, spanning a total of 50 epochs. For this phase, the model employs the ViT architecture, which has been pretrained on the ImageNet dataset. In the fine-tuning phase, the model takes advantage of 5k labeled samples from the OCTMNIST dataset, using a 10-fold cross-validation strategy. This method was chosen to promote a more stable and generalized learning process, maximizing the utility of the limited labeled data. Each fold consists of 4.5k training samples and 0.5k validation samples, with an additional 0.5k samples reserved for testing. The decision to reserve 0.5k samples for testing was made to ensure consistency across folds while keeping the test set independent. This setup allows for a robust yet computationally feasible assessment of the model's generalization performance. The fine-tuning process utilizes a batch size of 16, maintains the same learning rate from the pretraining phase, incorporates a dropout rate of 0.5, and also spans 50 epochs

Result

Performance and Efficiency Assessment on Imbalanced OCTMNIST Dataset

Performance and Efficiency Assessment on Imbalanced OCT2017v2 Dataset

Command

ssp_vit2spn.py: Trains the self-supervised model using unlabeled images to extract meaningful features.
finetune_vit2spn.py: Fine-tunes the pretrained model for classification tasks using labeled data.

Citation (BibTeX)

*@article{saraei2025vit,
  title={ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification},
  author={Saraei, Mohammadreza and Kozak, Igor and Lee, Eung-Joo},
  journal={arXiv preprint arXiv:2501.17260},
  year={2025}
}*

Please feel free to if you have any questions: [email protected] ↩

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
byol		byol
figures		figures
moco		moco
mocov2		mocov2
mocov3		mocov3
result		result
simclrv2		simclrv2
simsiam		simsiam
smclr		smclr
swav		swav
README.md		README.md
finetune_vit2spn.py		finetune_vit2spn.py
ssp_vit2spn.py		ssp_vit2spn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT-2SPN

Image Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance and Efficiency Assessment on Imbalanced OCTMNIST Dataset

Performance and Efficiency Assessment on Imbalanced OCT2017v2 Dataset

Command

Citation (BibTeX)

About

Releases

Packages

Languages

mrsaraei/ViT-2SPN

Folders and files

Latest commit

History

Repository files navigation

ViT-2SPN

Image Samples (Class: Normal, Drusen, DME, CNV)

ViT-2SPN Architecture

Experimental Setup

Result

Performance and Efficiency Assessment on Imbalanced OCTMNIST Dataset

Performance and Efficiency Assessment on Imbalanced OCT2017v2 Dataset

Command

Citation (BibTeX)

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages