Skip to content

MrGiovanni/Touchstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Touchstone Benchmark

Touchstone Benchmark

Participate - Touchstone 1.0 Participate - Touchstone 2.0

We present Touchstone, a large-scale medical segmentation benchmark based on annotated 5,195 CT volumes from 76 hospitals for training, and 6,933 CT volumes from 8 additional hospitals for testing. We invite AI inventors to train their models on AbdomenAtlas, and we independently evaluate their algorithms. We have already collaborated with 14 influential research teams, and we remain accepting new submissions.

Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi1, Wenxuan Li1, Yucheng Tang2, Fabian Isensee3, ..., Alan Yuille1, Zongwei Zhou1
1Johns Hopkins University, 2NVIDIA, 3DKFZ
NeurIPS 2024
JHU CS News

YouTube

Touchstone 1.0 Leaderboard

rank model organization average DSC paper github
πŸ† MedNeXt DKFZ 89.2 arXiv GitHub stars
πŸ† MedFormer Rutgers 89.0 arXiv GitHub stars
3 STU-Net-B Shanghai AI Lab 89.0 arXiv GitHub stars
4 nnU-Net U-Net DKFZ 88.9 arXiv GitHub stars
5 nnU-Net ResEncL DKFZ 88.8 arXiv GitHub stars
6 UniSeg NPU 88.8 arXiv GitHub stars
7 Diff-UNet HKUST 88.5 arXiv GitHub stars
8 LHU-Net UR 88.0 arXiv GitHub stars
9 NexToU HIT 87.8 arXiv GitHub stars
10 SegVol BAAI 87.1 arXiv GitHub stars
11 U-Net & CLIP CityU 87.1 arXiv GitHub stars
12 Swin UNETR & CLIP CityU 86.7 arXiv GitHub stars
13 UNesT NVIDIA 84.9 arXiv GitHub stars
14 Swin UNETR NVIDIA 84.8 arXiv GitHub stars
15 UNETR NVIDIA 83.3 arXiv GitHub stars
16 UCTransNet Northeastern University 81.1 arXiv GitHub stars
17 SAM-Adapter Duke 73.4 arXiv GitHub stars
Aorta - NexToU & UCTransNet πŸ†
rank model organization DSC paper github
πŸ† UCTransNet Northeastern University 86.5 arXiv GitHub stars
πŸ† NexToU HIT 86.4 arXiv GitHub stars
3 MedNeXt DKFZ 83.1 arXiv GitHub stars
4 nnU-Net U-Net DKFZ 82.8 arXiv GitHub stars
5 UniSeg NPU 82.3 arXiv GitHub stars
6 MedFormer Rutgers 82.1 arXiv GitHub stars
7 STU-Net-B Shanghai AI Lab 82.1 arXiv GitHub stars
8 nnU-Net ResEncL DKFZ 81.4 arXiv GitHub stars
9 Diff-UNet HKUST 81.2 arXiv GitHub stars
10 SegVol BAAI 80.2 arXiv GitHub stars
11 LHU-Net UR 79.5 arXiv GitHub stars
12 Swin UNETR & CLIP CityU 78.1 arXiv GitHub stars
13 UNesT NVIDIA 77.7 arXiv GitHub stars
14 Swin UNETR NVIDIA 77.2 arXiv GitHub stars
15 U-Net & CLIP CityU 77.1 arXiv GitHub stars
16 UNETR NVIDIA 76.5 arXiv GitHub stars
17 SAM-Adapter Duke 62.8 arXiv GitHub stars
Gallbladder - STU-Net-B & MedFormer πŸ†
rank model organization DSC paper github
πŸ† STU-Net-B Shanghai AI Lab 85.5 arXiv GitHub stars
πŸ† MedFormer Rutgers 85.3 arXiv GitHub stars
3 MedNeXt DKFZ 85.3 arXiv GitHub stars
4 nnU-Net ResEncL DKFZ 84.9 arXiv GitHub stars
5 nnU-Net U-Net DKFZ 84.7 arXiv GitHub stars
6 UniSeg NPU 84.7 arXiv GitHub stars
7 LHU-Net UR 83.9 arXiv GitHub stars
8 Diff-UNet HKUST 83.8 arXiv GitHub stars
9 NexToU HIT 82.3 arXiv GitHub stars
10 U-Net & CLIP CityU 82.1 arXiv GitHub stars
11 Swin UNETR & CLIP CityU 80.2 arXiv GitHub stars
12 SegVol BAAI 79.3 arXiv GitHub stars
13 UCTransNet Northeastern University 77.8 arXiv GitHub stars
14 Swin UNETR NVIDIA 76.9 arXiv GitHub stars
15 UNesT NVIDIA 75.1 arXiv GitHub stars
16 UNETR NVIDIA 74.7 arXiv GitHub stars
17 SAM-Adapter Duke 49.4 arXiv GitHub stars
KidneyL - Diff-UNet πŸ†
rank model organization DSC paper github
πŸ† Diff-UNet HKUST 91.9 arXiv GitHub stars
2 MedFormer Rutgers 91.9 arXiv GitHub stars
3 nnU-Net ResEncL DKFZ 91.9 arXiv GitHub stars
4 STU-Net-B Shanghai AI Lab 91.9 arXiv GitHub stars
5 nnU-Net U-Net DKFZ 91.9 arXiv GitHub stars
6 LHU-Net UR 91.8 arXiv GitHub stars
7 MedNeXt DKFZ 91.8 arXiv GitHub stars
8 SegVol BAAI 91.8 arXiv GitHub stars
9 UniSeg NPU 91.5 arXiv GitHub stars
10 U-Net & CLIP CityU 91.1 arXiv GitHub stars
11 Swin UNETR & CLIP CityU 91.0 arXiv GitHub stars
12 UNesT NVIDIA 90.1 arXiv GitHub stars
13 Swin UNETR NVIDIA 89.7 arXiv GitHub stars
14 NexToU HIT 89.6 arXiv GitHub stars
15 UNETR NVIDIA 89.2 arXiv GitHub stars
16 SAM-Adapter Duke 87.3 arXiv GitHub stars
17 UCTransNet Northeastern University 86.9 arXiv GitHub stars
KidneyR - Diff-UNet πŸ†
rank model organization DSC paper github
πŸ† Diff-UNet HKUST 92.8 arXiv GitHub stars
2 MedFormer Rutgers 92.8 arXiv GitHub stars
3 nnU-Net U-Net DKFZ 92.7 arXiv GitHub stars
4 MedNeXt DKFZ 92.6 arXiv GitHub stars
5 nnU-Net ResEncL DKFZ 92.6 arXiv GitHub stars
6 LHU-Net UR 92.5 arXiv GitHub stars
7 STU-Net-B Shanghai AI Lab 92.5 arXiv GitHub stars
8 SegVol BAAI 92.5 arXiv GitHub stars
9 UniSeg NPU 92.2 arXiv GitHub stars
10 U-Net & CLIP CityU 91.9 arXiv GitHub stars
11 Swin UNETR & CLIP CityU 91.7 arXiv GitHub stars
12 UNesT NVIDIA 90.9 arXiv GitHub stars
13 SAM-Adapter Duke 90.4 arXiv GitHub stars
14 NexToU HIT 90.1 arXiv GitHub stars
15 UNETR NVIDIA 90.1 arXiv GitHub stars
16 Swin UNETR NVIDIA 89.8 arXiv GitHub stars
17 UCTransNet Northeastern University 86.5 arXiv GitHub stars
Liver - MedFormer πŸ†
rank model organization DSC paper github
πŸ† MedFormer Rutgers 96.4 arXiv GitHub stars
2 MedNeXt DKFZ 96.3 arXiv GitHub stars
3 nnU-Net ResEncL DKFZ 96.3 arXiv GitHub stars
4 LHU-Net UR 96.2 arXiv GitHub stars
5 nnU-Net U-Net DKFZ 96.2 arXiv GitHub stars
6 Diff-UNet HKUST 96.2 arXiv GitHub stars
7 STU-Net-B Shanghai AI Lab 96.2 arXiv GitHub stars
8 UniSeg NPU 96.1 arXiv GitHub stars
9 U-Net & CLIP CityU 96.0 arXiv GitHub stars
10 SegVol BAAI 96.0 arXiv GitHub stars
11 Swin UNETR & CLIP CityU 95.8 arXiv GitHub stars
12 NexToU HIT 95.7 arXiv GitHub stars
13 SAM-Adapter Duke 94.1 arXiv GitHub stars
14 UNesT NVIDIA 95.3 arXiv GitHub stars
15 Swin UNETR NVIDIA 95.2 arXiv GitHub stars
16 UNETR NVIDIA 95.0 arXiv GitHub stars
17 UCTransNet Northeastern University 93.6 arXiv GitHub stars
Pancreas - MedNeXt πŸ†
rank model organization DSC paper github
πŸ† MedNeXt DKFZ 83.3 arXiv GitHub stars
2 STU-Net-B Shanghai AI Lab 83.2 arXiv GitHub stars
3 MedFormer Rutgers 83.1 arXiv GitHub stars
4 nnU-Net ResEncL DKFZ 82.9 arXiv GitHub stars
5 UniSeg NPU 82.7 arXiv GitHub stars
6 nnU-Net U-Net DKFZ 82.3 arXiv GitHub stars
7 Diff-UNet HKUST 81.9 arXiv GitHub stars
8 LHU-Net UR 81.0 arXiv GitHub stars
9 U-Net & CLIP CityU 80.8 arXiv GitHub stars
10 Swin UNETR & CLIP CityU 80.2 arXiv GitHub stars
11 NexToU HIT 80.2 arXiv GitHub stars
12 SegVol BAAI 79.1 arXiv GitHub stars
13 UNesT NVIDIA 76.2 arXiv GitHub stars
14 Swin UNETR NVIDIA 75.6 arXiv GitHub stars
15 UNETR NVIDIA 72.3 arXiv GitHub stars
16 UCTransNet Northeastern University 59.0 arXiv GitHub stars
17 SAM-Adapter Duke 50.2 arXiv GitHub stars
Postcava - STU-Net-B & MedNeXt πŸ†
rank model organization DSC paper github
πŸ† STU-Net-B Shanghai AI Lab 81.3 arXiv GitHub stars
πŸ† MedNeXt DKFZ 81.3 arXiv GitHub stars
3 UniSeg NPU 81.2 arXiv GitHub stars
4 nnU-Net U-Net DKFZ 81.0 arXiv GitHub stars
5 Diff-UNet HKUST 80.8 arXiv GitHub stars
6 MedFormer Rutgers 80.7 arXiv GitHub stars
7 nnU-Net ResEncL DKFZ 80.5 arXiv GitHub stars
8 LHU-Net UR 79.4 arXiv GitHub stars
9 U-Net & CLIP CityU 78.5 arXiv GitHub stars
10 NexToU HIT 78.1 arXiv GitHub stars
11 SegVol BAAI 77.8 arXiv GitHub stars
12 Swin UNETR & CLIP CityU 76.8 arXiv GitHub stars
13 Swin UNETR NVIDIA 75.4 arXiv GitHub stars
14 UNesT NVIDIA 74.4 arXiv GitHub stars
15 UNETR NVIDIA 71.5 arXiv GitHub stars
15 UCTransNet Northeastern University 68.1 arXiv GitHub stars
17 SAM-Adapter Duke 48.0 arXiv GitHub stars
Spleen - MedFormer πŸ†
rank model organization DSC paper github
πŸ† MedFormer Rutgers 95.5 arXiv GitHub stars
2 nnU-Net ResEncL DKFZ 95.2 arXiv GitHub stars
3 MedNeXt DKFZ 95.2 arXiv GitHub stars
4 nnU-Net U-Net DKFZ 95.1 arXiv GitHub stars
5 STU-Net-B Shanghai AI Lab 95.1 arXiv GitHub stars
6 Diff-UNet HKUST 95.0 arXiv GitHub stars
7 LHU-Net UR 94.9 arXiv GitHub stars
8 UniSeg NPU 94.9 arXiv GitHub stars
9 SegVol BAAI 94.5 arXiv GitHub stars
10 NexToU HIT 94.7 arXiv GitHub stars
11 U-Net & CLIP CityU 94.3 arXiv GitHub stars
12 Swin UNETR & CLIP CityU 94.1 arXiv GitHub stars
13 UNesT NVIDIA 93.2 arXiv GitHub stars
14 Swin UNETR NVIDIA 92.7 arXiv GitHub stars
15 UNETR NVIDIA 91.7 arXiv GitHub stars
16 SAM-Adapter Duke 90.5 arXiv GitHub stars
17 UCTransNet Northeastern University 90.2 arXiv GitHub stars
Stomach - STU-Net-B πŸ†
rank model organization DSC paper github
πŸ† STU-Net-B Shanghai AI Lab 93.5 arXiv GitHub stars
2 MedNeXt DKFZ 93.5 arXiv GitHub stars
3 nnU-Net ResEncL DKFZ 93.4 arXiv GitHub stars
4 MedFormer Rutgers 93.4 arXiv GitHub stars
5 UniSeg NPU 93.3 arXiv GitHub stars
6 nnU-Net U-Net DKFZ 93.3 arXiv GitHub stars
7 Diff-UNet HKUST 93.1 arXiv GitHub stars
8 LHU-Net UR 93.0 arXiv GitHub stars
9 NexToU HIT 92.7 arXiv GitHub stars
10 SegVol BAAI 92.5 arXiv GitHub stars
11 U-Net & CLIP CityU 92.4 arXiv GitHub stars
12 Swin UNETR & CLIP CityU 92.2 arXiv GitHub stars
13 UNesT NVIDIA 90.9 arXiv GitHub stars
14 Swin UNETR NVIDIA 90.5 arXiv GitHub stars
15 UNETR NVIDIA 88.8 arXiv GitHub stars
16 SAM-Adapter Duke 88.0 arXiv GitHub stars
17 UCTransNet Northeastern University 81.9 arXiv GitHub stars

Touchstone 1.0 Dataset

Training set

Test set

metadata

Figure 1. Metadata distribution in the test set.

Touchstone 1.0 Model

Note

We are releasing the trained AI models evaluated in Touchstone right here. Stay tuned!

rank model average DSC parameter infer. speed download
πŸ† MedNeXt 89.2 61.8M β˜…β˜†β˜†β˜†β˜†
πŸ† MedFormer 89.0 38.5M β˜…β˜…β˜…β˜†β˜†
3 STU-Net-B 89.0 58.3M β˜…β˜…β˜†β˜†β˜† checkpoint
4 nnU-Net U-Net 88.9 102.0M β˜…β˜…β˜…β˜…β˜† checkpoint
5 nnU-Net ResEncL 88.8 102.0M β˜…β˜…β˜…β˜…β˜† checkpoint
6 UniSeg 88.8 31.0M β˜†β˜†β˜†β˜†β˜†
7 Diff-UNet 88.5 434.0M β˜…β˜…β˜…β˜†β˜†
8 LHU-Net 88.0 8.6M β˜…β˜…β˜…β˜…β˜… checkpoint
9 NexToU 87.8 81.9M β˜…β˜…β˜…β˜…β˜† checkpoint
10 SegVol 87.1 181.0M β˜…β˜…β˜…β˜…β˜† checkpoint
11 U-Net & CLIP 87.1 19.1M β˜…β˜…β˜…β˜†β˜†
12 Swin UNETR & CLIP 86.7 62.2M β˜…β˜…β˜…β˜†β˜†
13 Swin UNETR 84.8 72.8M β˜…β˜…β˜…β˜…β˜…
14 UNesT 84.9 87.2M β˜…β˜…β˜…β˜…β˜…
15 UNETR 83.3 101.8M β˜…β˜…β˜…β˜…β˜…
16 UCTransNet 81.1 68.0M β˜…β˜…β˜…β˜…β˜†
17 SAM-Adapter 73.4 11.6M β˜…β˜…β˜…β˜…β˜† checkpoint

Evaluation Code

Click to expand

1. Clone the GitHub repository

git clone https://github.com/MrGiovanni/Touchstone
cd Touchstone

2. Create environments

conda env create -f environment.yml
source activate touchstone
python -m ipykernel install --user --name touchstone --display-name "touchstone"

3. Reproduce analysis figures in our paper

Figure 1 - Dataset statistics:

cd notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone TotalSegmentatorMetadata.ipynb
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone DAPAtlasMetadata.ipynb
#results: plots are saved inside Touchstone/outputs/plotsTotalSegmentator/ and Touchstone/outputs/plotsDAPAtlas/

Figure 2 - Potential confrounders significantly impact AI performance:

cd ../plot
python AggregatedBoxplot.py --stats
#results: Touchstone/outputs/summary_groups.pdf

If you are including a new segmentation model in the evaluation, organize its results following the structure in the CSV files inside the folders totalsegmentator_results and dapatlas_results (see below). Also, include its name in the model_ranking list in plot/PlotGroup.py.

File structure
totalsegmentator_results
    β”œβ”€β”€ Diff-UNet
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ LHU-Net
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ MedNeXt
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ ...
dapatlas_results
    β”œβ”€β”€ Diff-UNet
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ LHU-Net
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ MedNeXt
    β”‚   β”œβ”€β”€ dsc.csv
    β”‚   └── nsd.csv
    β”œβ”€β”€ ...

Appendix D.2.3 - Statistical significance maps:

#statistical significance maps (Appendix D.2.3):
python PlotAllSignificanceMaps.py
python PlotAllSignificanceMaps.py --organs second_half
python PlotAllSignificanceMaps.py --nsd
python PlotAllSignificanceMaps.py --organs second_half --nsd
#results: Touchstone/outputs/heatmaps

Appendix D.4 and D.5 - Box-plots for per-group and per-organ results, with statistical tests:

cd ../notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone GroupAnalysis.ipynb
#results: Touchstone/outputs/box_plots

4. Custom Analysis

Define custom demographic groups (e.g., hispanic men aged 20-25) and compare AI performance on them

The csv results files in totalsegmentator_results/ and dapatlas_results/ contain per-sample dsc and nsd scores. Rich meatdata for each one of those samples (sex, age, scanner, diagnosis,...) are available in metaTotalSeg.csv and 'Clinical Metadata FDG PET_CT Lesions.csv', for TotalSegmentator and DAP Atlas, respectively. The code in TotalSegmentatorMetadata.ipynb and DAPAtlasMetadata.ipynb extracts this meatdata into simplfied group lists (e.g., a list of all samples representing male patients), and saves these lists in the folders plotsTotalSegmentator/ and plotsDAPAtlas/. You can modify the code to generate custom sample lists (e.g., all men aged 30-35). To compare a set of groups, the filenames of all lists in the set should begin with the same name. For example, comp1_list_a.pt, comp1_list_b.pt, comp1_list_C.pt can represent a set of 3 groups. Then, PlotGroup.py can draw boxplots and perform statistical tests comparing the AI algorithm's results (dsc and nsd) for the samples inside the different custom lists you created. In our example, you just just need to specify --group_name comp1 when running PlotGroup.py:

python utils/PlotGroup.py --ckpt_root totalsegmentator_results/ --group_root outputs/plotsTotalSegmentator/ --group_name comp1 --organ liver --stats

Citation

Please cite the following papers if you find our study helpful.

@article{bassi2024touchstone,
  title={Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?},
  author={Bassi, Pedro RAS and Li, Wenxuan and Tang, Yucheng and Isensee, Fabian and Wang, Zifu and Chen, Jieneng and Chou, Yu-Cheng and Kirchhoff, Yannick and Rokuss, Maximilian and Huang, Ziyan and Ye, Jin and He, Junjun and Wald, Tassilo and Ulrich, Constantin and Baumgartner, Michael and Roy, Saikat and Maier-Hein, Klaus H. and Jaeger, Paul and Ye, Yiwen and Xie, Yutong and Zhang, Jianpeng and Chen, Ziyang and Xia, Yong and Xing, Zhaohu and Zhu, Lei and Sadegheih, Yousef and Bozorgpour, Afshin and Kumari, Pratibha and Azad, Reza and Merhof, Dorit and Shi, Pengcheng and Ma, Ting and Du, Yuxin and Bai, Fan and Huang, Tiejun and Zhao, Bo and Wang, Haonan and Li, Xiaomeng and Gu, Hanxue and Dong, Haoyu and Yang, Jichen and Mazurowski, Maciej A. and Gupta, Saumya and Wu, Linshan and Zhuang, Jiaxin and Chen, Hao and Roth, Holger and Xu, Daguang and Blaschko, Matthew B. and Decherchi, Sergio and Cavalli, Andrea and Yuille, Alan L. and Zhou, Zongwei},
  journal={Conference on Neural Information Processing Systems},
  year={2024},
  utl={https://github.com/MrGiovanni/Touchstone}
}

@article{li2024abdomenatlas,
  title={AbdomenAtlas: A large-scale, detailed-annotated, \& multi-center dataset for efficient transfer learning and open algorithmic benchmarking},
  author={Li, Wenxuan and Qu, Chongyu and Chen, Xiaoxi and Bassi, Pedro RAS and Shi, Yijia and Lai, Yuxiang and Yu, Qian and Xue, Huimin and Chen, Yixiong and Lin, Xiaorui and others},
  journal={Medical Image Analysis},
  pages={103285},
  year={2024},
  publisher={Elsevier}
}

@inproceedings{li2024well,
  title={How Well Do Supervised Models Transfer to 3D Image Segmentation?},
  author={Li, Wenxuan and Yuille, Alan and Zhou, Zongwei},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

@article{qu2023abdomenatlas,
  title={Abdomenatlas-8k: Annotating 8,000 CT volumes for multi-organ segmentation in three weeks},
  author={Qu, Chongyu and Zhang, Tiezheng and Qiao, Hualin and Tang, Yucheng and Yuille, Alan L and Zhou, Zongwei and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Acknowledgement

This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the McGovern Foundation. Paper content is covered by patents pending.

Touchstone Benchmark

About

[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •