Touchstone Benchmark

Subscribe us: https://groups.google.com/u/2/g/bodymaps

We present Touchstone, a large-scale medical segmentation benchmark based on annotated 5,195 CT volumes from 76 hospitals for training, and 6,933 CT volumes from 8 additional hospitals for testing. We invite AI inventors to train their models on AbdomenAtlas, and we independently evaluate their algorithms. We have already collaborated with 14 influential research teams, and we remain accepting new submissions.

Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi¹, Wenxuan Li¹, Yucheng Tang², Fabian Isensee³, ..., Alan Yuille¹, Zongwei Zhou¹
¹Johns Hopkins University, ²NVIDIA, ³DKFZ
NeurIPS 2024
JHU CS News

Touchstone 1.0 Leaderboard

rank	model	organization	average DSC
🏆	MedNeXt	DKFZ	89.2
🏆	MedFormer	Rutgers	89.0
3	STU-Net-B	Shanghai AI Lab	89.0
4	nnU-Net U-Net	DKFZ	88.9
5	nnU-Net ResEncL	DKFZ	88.8
6	UniSeg	NPU	88.8
7	Diff-UNet	HKUST	88.5
8	LHU-Net	UR	88.0
9	NexToU	HIT	87.8
10	SegVol	BAAI	87.1
11	U-Net & CLIP	CityU	87.1
12	Swin UNETR & CLIP	CityU	86.7
13	UNesT	NVIDIA	84.9
14	Swin UNETR	NVIDIA	84.8
15	UNETR	NVIDIA	83.3
16	UCTransNet	Northeastern University	81.1
17	SAM-Adapter	Duke	73.4

Aorta - NexToU & UCTransNet 🏆

rank	model	organization	DSC
🏆	UCTransNet	Northeastern University	86.5
🏆	NexToU	HIT	86.4
3	MedNeXt	DKFZ	83.1
4	nnU-Net U-Net	DKFZ	82.8
5	UniSeg	NPU	82.3
6	MedFormer	Rutgers	82.1
7	STU-Net-B	Shanghai AI Lab	82.1
8	nnU-Net ResEncL	DKFZ	81.4
9	Diff-UNet	HKUST	81.2
10	SegVol	BAAI	80.2
11	LHU-Net	UR	79.5
12	Swin UNETR & CLIP	CityU	78.1
13	UNesT	NVIDIA	77.7
14	Swin UNETR	NVIDIA	77.2
15	U-Net & CLIP	CityU	77.1
16	UNETR	NVIDIA	76.5
17	SAM-Adapter	Duke	62.8

Gallbladder - STU-Net-B & MedFormer 🏆

rank	model	organization	DSC
🏆	STU-Net-B	Shanghai AI Lab	85.5
🏆	MedFormer	Rutgers	85.3
3	MedNeXt	DKFZ	85.3
4	nnU-Net ResEncL	DKFZ	84.9
5	nnU-Net U-Net	DKFZ	84.7
6	UniSeg	NPU	84.7
7	LHU-Net	UR	83.9
8	Diff-UNet	HKUST	83.8
9	NexToU	HIT	82.3
10	U-Net & CLIP	CityU	82.1
11	Swin UNETR & CLIP	CityU	80.2
12	SegVol	BAAI	79.3
13	UCTransNet	Northeastern University	77.8
14	Swin UNETR	NVIDIA	76.9
15	UNesT	NVIDIA	75.1
16	UNETR	NVIDIA	74.7
17	SAM-Adapter	Duke	49.4

KidneyL - Diff-UNet 🏆

rank	model	organization	DSC
🏆	Diff-UNet	HKUST	91.9
2	MedFormer	Rutgers	91.9
3	nnU-Net ResEncL	DKFZ	91.9
4	STU-Net-B	Shanghai AI Lab	91.9
5	nnU-Net U-Net	DKFZ	91.9
6	LHU-Net	UR	91.8
7	MedNeXt	DKFZ	91.8
8	SegVol	BAAI	91.8
9	UniSeg	NPU	91.5
10	U-Net & CLIP	CityU	91.1
11	Swin UNETR & CLIP	CityU	91.0
12	UNesT	NVIDIA	90.1
13	Swin UNETR	NVIDIA	89.7
14	NexToU	HIT	89.6
15	UNETR	NVIDIA	89.2
16	SAM-Adapter	Duke	87.3
17	UCTransNet	Northeastern University	86.9

KidneyR - Diff-UNet 🏆

rank	model	organization	DSC
🏆	Diff-UNet	HKUST	92.8
2	MedFormer	Rutgers	92.8
3	nnU-Net U-Net	DKFZ	92.7
4	MedNeXt	DKFZ	92.6
5	nnU-Net ResEncL	DKFZ	92.6
6	LHU-Net	UR	92.5
7	STU-Net-B	Shanghai AI Lab	92.5
8	SegVol	BAAI	92.5
9	UniSeg	NPU	92.2
10	U-Net & CLIP	CityU	91.9
11	Swin UNETR & CLIP	CityU	91.7
12	UNesT	NVIDIA	90.9
13	SAM-Adapter	Duke	90.4
14	NexToU	HIT	90.1
15	UNETR	NVIDIA	90.1
16	Swin UNETR	NVIDIA	89.8
17	UCTransNet	Northeastern University	86.5

Liver - MedFormer 🏆

rank	model	organization	DSC
🏆	MedFormer	Rutgers	96.4
2	MedNeXt	DKFZ	96.3
3	nnU-Net ResEncL	DKFZ	96.3
4	LHU-Net	UR	96.2
5	nnU-Net U-Net	DKFZ	96.2
6	Diff-UNet	HKUST	96.2
7	STU-Net-B	Shanghai AI Lab	96.2
8	UniSeg	NPU	96.1
9	U-Net & CLIP	CityU	96.0
10	SegVol	BAAI	96.0
11	Swin UNETR & CLIP	CityU	95.8
12	NexToU	HIT	95.7
13	SAM-Adapter	Duke	94.1
14	UNesT	NVIDIA	95.3
15	Swin UNETR	NVIDIA	95.2
16	UNETR	NVIDIA	95.0
17	UCTransNet	Northeastern University	93.6

Pancreas - MedNeXt 🏆

rank	model	organization	DSC
🏆	MedNeXt	DKFZ	83.3
2	STU-Net-B	Shanghai AI Lab	83.2
3	MedFormer	Rutgers	83.1
4	nnU-Net ResEncL	DKFZ	82.9
5	UniSeg	NPU	82.7
6	nnU-Net U-Net	DKFZ	82.3
7	Diff-UNet	HKUST	81.9
8	LHU-Net	UR	81.0
9	U-Net & CLIP	CityU	80.8
10	Swin UNETR & CLIP	CityU	80.2
11	NexToU	HIT	80.2
12	SegVol	BAAI	79.1
13	UNesT	NVIDIA	76.2
14	Swin UNETR	NVIDIA	75.6
15	UNETR	NVIDIA	72.3
16	UCTransNet	Northeastern University	59.0
17	SAM-Adapter	Duke	50.2

Postcava - STU-Net-B & MedNeXt 🏆

rank	model	organization	DSC
🏆	STU-Net-B	Shanghai AI Lab	81.3
🏆	MedNeXt	DKFZ	81.3
3	UniSeg	NPU	81.2
4	nnU-Net U-Net	DKFZ	81.0
5	Diff-UNet	HKUST	80.8
6	MedFormer	Rutgers	80.7
7	nnU-Net ResEncL	DKFZ	80.5
8	LHU-Net	UR	79.4
9	U-Net & CLIP	CityU	78.5
10	NexToU	HIT	78.1
11	SegVol	BAAI	77.8
12	Swin UNETR & CLIP	CityU	76.8
13	Swin UNETR	NVIDIA	75.4
14	UNesT	NVIDIA	74.4
15	UNETR	NVIDIA	71.5
15	UCTransNet	Northeastern University	68.1
17	SAM-Adapter	Duke	48.0

Spleen - MedFormer 🏆

rank	model	organization	DSC
🏆	MedFormer	Rutgers	95.5
2	nnU-Net ResEncL	DKFZ	95.2
3	MedNeXt	DKFZ	95.2
4	nnU-Net U-Net	DKFZ	95.1
5	STU-Net-B	Shanghai AI Lab	95.1
6	Diff-UNet	HKUST	95.0
7	LHU-Net	UR	94.9
8	UniSeg	NPU	94.9
9	SegVol	BAAI	94.5
10	NexToU	HIT	94.7
11	U-Net & CLIP	CityU	94.3
12	Swin UNETR & CLIP	CityU	94.1
13	UNesT	NVIDIA	93.2
14	Swin UNETR	NVIDIA	92.7
15	UNETR	NVIDIA	91.7
16	SAM-Adapter	Duke	90.5
17	UCTransNet	Northeastern University	90.2

Stomach - STU-Net-B 🏆

rank	model	organization	DSC
🏆	STU-Net-B	Shanghai AI Lab	93.5
2	MedNeXt	DKFZ	93.5
3	nnU-Net ResEncL	DKFZ	93.4
4	MedFormer	Rutgers	93.4
5	UniSeg	NPU	93.3
6	nnU-Net U-Net	DKFZ	93.3
7	Diff-UNet	HKUST	93.1
8	LHU-Net	UR	93.0
9	NexToU	HIT	92.7
10	SegVol	BAAI	92.5
11	U-Net & CLIP	CityU	92.4
12	Swin UNETR & CLIP	CityU	92.2
13	UNesT	NVIDIA	90.9
14	Swin UNETR	NVIDIA	90.5
15	UNETR	NVIDIA	88.8
16	SAM-Adapter	Duke	88.0
17	UCTransNet	Northeastern University	81.9

Touchstone 1.0 Dataset

Training set

Touchstone 1.0: AbdomenAtlas1.0Mini (N=5,195)
Touchstone 2.0: AbdomenAtlas1.1Mini (N=9,262)

Test set

Proprietary JHH dataset (N=5,172)
Public TotalSegmentator V2 dataset (N=1,228)

Figure 1. Metadata distribution in the test set.

Touchstone 1.0 Model

Note

We are releasing the trained AI models evaluated in Touchstone right here. Stay tuned!

rank	model	average DSC	parameter	infer. speed
🏆	MedNeXt	89.2	61.8M	★☆☆☆☆
🏆	MedFormer	89.0	38.5M	★★★☆☆
3	STU-Net-B	89.0	58.3M	★★☆☆☆
4	nnU-Net U-Net	88.9	102.0M	★★★★☆
5	nnU-Net ResEncL	88.8	102.0M	★★★★☆
6	UniSeg	88.8	31.0M	☆☆☆☆☆
7	Diff-UNet	88.5	434.0M	★★★☆☆
8	LHU-Net	88.0	8.6M	★★★★★
9	NexToU	87.8	81.9M	★★★★☆
10	SegVol	87.1	181.0M	★★★★☆
11	U-Net & CLIP	87.1	19.1M	★★★☆☆
12	Swin UNETR & CLIP	86.7	62.2M	★★★☆☆
13	Swin UNETR	84.8	72.8M	★★★★★
14	UNesT	84.9	87.2M	★★★★★
15	UNETR	83.3	101.8M	★★★★★
16	UCTransNet	81.1	68.0M	★★★★☆
17	SAM-Adapter	73.4	11.6M	★★★★☆

Evaluation Code

Click to expand

1. Clone the GitHub repository

git clone https://github.com/MrGiovanni/Touchstone
cd Touchstone

2. Create environments

conda env create -f environment.yml
source activate touchstone
python -m ipykernel install --user --name touchstone --display-name "touchstone"

3. Reproduce analysis figures in our paper

Figure 1 - Dataset statistics:

cd notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone TotalSegmentatorMetadata.ipynb
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone DAPAtlasMetadata.ipynb
#results: plots are saved inside Touchstone/outputs/plotsTotalSegmentator/ and Touchstone/outputs/plotsDAPAtlas/

Figure 2 - Potential confrounders significantly impact AI performance:

cd ../plot
python AggregatedBoxplot.py --stats
#results: Touchstone/outputs/summary_groups.pdf

If you are including a new segmentation model in the evaluation, organize its results following the structure in the CSV files inside the folders totalsegmentator_results and dapatlas_results (see below). Also, include its name in the model_ranking list in plot/PlotGroup.py.

File structure

totalsegmentator_results
    ├── Diff-UNet
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── LHU-Net
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── MedNeXt
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── ...
dapatlas_results
    ├── Diff-UNet
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── LHU-Net
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── MedNeXt
    │   ├── dsc.csv
    │   └── nsd.csv
    ├── ...

Appendix D.2.3 - Statistical significance maps:

#statistical significance maps (Appendix D.2.3):
python PlotAllSignificanceMaps.py
python PlotAllSignificanceMaps.py --organs second_half
python PlotAllSignificanceMaps.py --nsd
python PlotAllSignificanceMaps.py --organs second_half --nsd
#results: Touchstone/outputs/heatmaps

Appendix D.4 and D.5 - Box-plots for per-group and per-organ results, with statistical tests:

cd ../notebooks
jupyter nbconvert --to notebook --execute --ExecutePreprocessor.kernel_name=touchstone GroupAnalysis.ipynb
#results: Touchstone/outputs/box_plots

4. Custom Analysis

Define custom demographic groups (e.g., hispanic men aged 20-25) and compare AI performance on them

The csv results files in totalsegmentator_results/ and dapatlas_results/ contain per-sample dsc and nsd scores. Rich meatdata for each one of those samples (sex, age, scanner, diagnosis,...) are available in metaTotalSeg.csv and 'Clinical Metadata FDG PET_CT Lesions.csv', for TotalSegmentator and DAP Atlas, respectively. The code in TotalSegmentatorMetadata.ipynb and DAPAtlasMetadata.ipynb extracts this meatdata into simplfied group lists (e.g., a list of all samples representing male patients), and saves these lists in the folders plotsTotalSegmentator/ and plotsDAPAtlas/. You can modify the code to generate custom sample lists (e.g., all men aged 30-35). To compare a set of groups, the filenames of all lists in the set should begin with the same name. For example, comp1_list_a.pt, comp1_list_b.pt, comp1_list_C.pt can represent a set of 3 groups. Then, PlotGroup.py can draw boxplots and perform statistical tests comparing the AI algorithm's results (dsc and nsd) for the samples inside the different custom lists you created. In our example, you just just need to specify --group_name comp1 when running PlotGroup.py:

python utils/PlotGroup.py --ckpt_root totalsegmentator_results/ --group_root outputs/plotsTotalSegmentator/ --group_name comp1 --organ liver --stats

Citation

Please cite the following papers if you find our study helpful.

@article{bassi2024touchstone,
  title={Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?},
  author={Bassi, Pedro RAS and Li, Wenxuan and Tang, Yucheng and Isensee, Fabian and Wang, Zifu and Chen, Jieneng and Chou, Yu-Cheng and Kirchhoff, Yannick and Rokuss, Maximilian and Huang, Ziyan and Ye, Jin and He, Junjun and Wald, Tassilo and Ulrich, Constantin and Baumgartner, Michael and Roy, Saikat and Maier-Hein, Klaus H. and Jaeger, Paul and Ye, Yiwen and Xie, Yutong and Zhang, Jianpeng and Chen, Ziyang and Xia, Yong and Xing, Zhaohu and Zhu, Lei and Sadegheih, Yousef and Bozorgpour, Afshin and Kumari, Pratibha and Azad, Reza and Merhof, Dorit and Shi, Pengcheng and Ma, Ting and Du, Yuxin and Bai, Fan and Huang, Tiejun and Zhao, Bo and Wang, Haonan and Li, Xiaomeng and Gu, Hanxue and Dong, Haoyu and Yang, Jichen and Mazurowski, Maciej A. and Gupta, Saumya and Wu, Linshan and Zhuang, Jiaxin and Chen, Hao and Roth, Holger and Xu, Daguang and Blaschko, Matthew B. and Decherchi, Sergio and Cavalli, Andrea and Yuille, Alan L. and Zhou, Zongwei},
  journal={Conference on Neural Information Processing Systems},
  year={2024},
  utl={https://github.com/MrGiovanni/Touchstone}
}

@article{li2024abdomenatlas,
  title={AbdomenAtlas: A large-scale, detailed-annotated, \& multi-center dataset for efficient transfer learning and open algorithmic benchmarking},
  author={Li, Wenxuan and Qu, Chongyu and Chen, Xiaoxi and Bassi, Pedro RAS and Shi, Yijia and Lai, Yuxiang and Yu, Qian and Xue, Huimin and Chen, Yixiong and Lin, Xiaorui and others},
  journal={Medical Image Analysis},
  pages={103285},
  year={2024},
  publisher={Elsevier}
}

@inproceedings{li2024well,
  title={How Well Do Supervised Models Transfer to 3D Image Segmentation?},
  author={Li, Wenxuan and Yuille, Alan and Zhou, Zongwei},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

@article{qu2023abdomenatlas,
  title={Abdomenatlas-8k: Annotating 8,000 CT volumes for multi-organ segmentation in three weeks},
  author={Qu, Chongyu and Zhang, Tiezheng and Qiao, Hualin and Tang, Yucheng and Yuille, Alan L and Zhou, Zongwei and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Acknowledgement

This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the McGovern Foundation. Paper content is covered by patents pending.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
document		document
example_outputs		example_outputs
notebooks		notebooks
plot		plot
totalsegmentator_results		totalsegmentator_results
utils		utils
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Touchstone Benchmark

Paper

Touchstone 1.0 Leaderboard

Touchstone 1.0 Dataset

Training set

Test set

Touchstone 1.0 Model

Evaluation Code

1. Clone the GitHub repository

2. Create environments

3. Reproduce analysis figures in our paper

Figure 1 - Dataset statistics:

Figure 2 - Potential confrounders significantly impact AI performance:

Appendix D.2.3 - Statistical significance maps:

Appendix D.4 and D.5 - Box-plots for per-group and per-organ results, with statistical tests:

4. Custom Analysis

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

MrGiovanni/Touchstone

Folders and files

Latest commit

History

Repository files navigation

Touchstone Benchmark

Paper

Touchstone 1.0 Leaderboard

Touchstone 1.0 Dataset

Training set

Test set

Touchstone 1.0 Model

Evaluation Code

1. Clone the GitHub repository

2. Create environments

3. Reproduce analysis figures in our paper

Figure 1 - Dataset statistics:

Figure 2 - Potential confrounders significantly impact AI performance:

Appendix D.2.3 - Statistical significance maps:

Appendix D.4 and D.5 - Box-plots for per-group and per-organ results, with statistical tests:

4. Custom Analysis

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages