Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

Abstract Vision foundation models (FMs) are accelerating the devel- opment of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.

Code and Data

We provide code and data for:

Data Curation: Based on the code from https://github.com/facebookresearch/ssl-data-curation, with our added configuration files orig-ssl-data-curation/configs. We ran the clustering on 350M tiles from TCGA and GTEx whole slide images (WSIs) embedded with the foundation model UNI. However, due to the large size of the full tile embeddings (~670GB), we are not able to provide the embeddings file. We provide a pipeline for tile extraction and UNI feature embedding in tile_extraction_embedding_generation.py, which requires the downloaded TCGA and GTEx WSIs. It consists of the following steps:
- Tile extraction: The slides can be downloaded from the TCGA and GTEx websites. The slide ids and tiles (x,y) coordinates are specified in clustering_results/clustering_{t1,t2}.csv, tile coordinates are specified at the highest available pyramid level of the slide, the tiles are of size 224px X 224px at 20x magnification (=112um X 112um).
- Feature embedding: Generate the feature embedding using the UNI model on the extracted tiles. Combine all the embeddings into a large numpy file. With it available, you can run the data curation by following the instructions from the Meta repository for k-means clustering and sampling from the tree.

Clustering results: Our clustering results are available as csvs files at our HuggingFace Repo. Structure of clustering_results/clustering_{t1,t2}.csv:

slide_id	tile_x	tile_y	level_1	level_2	level_3	level_4
TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39...	32406	10621	1301309	17404	2	24
TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39...	32850	10621	3481104	17557	343	8
TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39...	30630	11064	2269415	34147	2	24
TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39...	31074	11064	3352403	3486	2	24
TCGA-22-1017-01Z-00-DX1.9562FE79-A261-42D3-B39...	31519	11064	3352388	11187	2	24

slide_id: Unique identifier for the slide image, tile_x, tile_y: Coordinates of the tile within the slide at level 0, the highest pyramid level. The tiles are of size 224px X 224pxat 20x magnification (=112um X 112um), level_1 to level_4: Hierarchical cluster labels the tile is associated with.

Metadata-based sampling: metadata_based_sampling.
Batch stratification based on curated trees: batch_stratification.
Vision self-supervised training based on DINOv2: dinov2.
Visualization Tool: visualization_tool.

License

Please cite our publication, if you use the provided code or data.

@misc{chen2025revisitingautomaticdatacuration,
      title={Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology}, 
      author={Boqi Chen and Cédric Vincent-Cuaz and Lydia A. Schoenpflug and Manuel Madeira and Lisa Fournier and Vaishnavi Subramanian and Sonali Andani and Samuel Ruiperez-Campillo and Julia E. Vogt and Raphaëlle Luisier and Dorina Thanou and Viktor H. Koelzer and Pascal Frossard and Gabriele Campanella and Gunnar Rätsch},
      year={2025},
      eprint={2503.18709},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18709}, 
}

Citations

We build our work on the following public repositories:

ssl-data-curation:

@article{vo2024automatic,
  title={Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach},
  author={Vo, Huy V. and Khalidov, Vasil and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Smetanin, Nikita and Szafraniec, Marc and Touvron, Hugo and Couprie, Camille and Oquab, Maxime and Joulin, Armand and Jégou, Hervé and Labatut, Patrick and Bojanowski, Piotr},
  journal={arXiv:2405.15613},
  year={2024},
}

Histomorphological-Phenotype-Learning:

@article{QuirosCoudray2024,
	author = {Claudio Quiros, Adalberto and Coudray, Nicolas and Yeaton, Anna and Yang, Xinyu and Liu, Bojing and Le, Hortense and Chiriboga, Luis and Karimkhan, Afreen and Narula, Navneet and Moore, David A. and Park, Christopher Y. and Pass, Harvey and Moreira, Andre L. and Le Quesne, John and Tsirigos, Aristotelis and Yuan, Ke},
	journal = {Nature Communications},
	number = {1},
	pages = {4596},
	title = {Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides},
	volume = {15},
	year = {2024}}
}

dinov2:

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}
@misc{darcet2023vitneedreg,
  title={Vision Transformers Need Registers},
  author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv:2309.16588},
  year={2023}
}

UNI:

@article{chen2024uni,
  title={Towards a General-Purpose Foundation Model for Computational Pathology},
  author={Chen, Richard J and Ding, Tong and Lu, Ming Y and Williamson, Drew FK and Jaume, Guillaume and Chen, Bowen and Zhang, Andrew and Shao, Daniel and Song, Andrew H and Shaban, Muhammad and others},
  journal={Nature Medicine},
  publisher={Nature Publishing Group},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
batch_stratification		batch_stratification
clustering_results		clustering_results
dinov2		dinov2
embedding_generation		embedding_generation
metadata_based_sampling		metadata_based_sampling
orig-ssl-data-curation @ 6310e83		orig-ssl-data-curation @ 6310e83
visualization_tool		visualization_tool
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

Code and Data

License

Citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

swiss-ai/patho-ssl-data-curation

Folders and files

Latest commit

History

Repository files navigation

Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

Code and Data

License

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages