This repository serves as a centralized resource listing various breast imaging and pathology datasets commonly used in academic research, clinical training, and machine learning applications. The goal is to provide detailed summaries of key characteristics, usage instructions, and guidance on how to obtain these datasets, all in one place. If you know any more datasets, and want to contribute, please submit a pull request and I'll happily approve it.
This repository provides a curated list of breast imaging and histopathology datasets, aiming to streamline access for researchers, clinicians, and students. Here, the datasets are separated by each modality for easier understanding. This repository is composed of 35 publicly available datasets.
Below is a histogram showing the number of datasets in each modality, and pie chart representations that visualize the distribution of datasets by modality.
Histogram of Datasets by Modality:
Pie Chart of Datasets Distribution by Modality:
Dataset | Subjects | Nº Samples | Format | Size | Year | Cite | Access data |
---|---|---|---|---|---|---|---|
Breast Ultrasound Images (BUSI) | 600 | 780 | PNG | 204MB | 2020 | Dataset of breast ultrasound images | Download here |
Breast Lesions USG | 256 | 522 | PNG | 66.67MB | 2024 | Curated benchmark dataset for ultrasound based breast lesion analysis | Download here |
UDIAT Breast Ultrasound Dataset B | 163 | 163 | N/A | N/A | 2017 | Automated Breast Ultrasound Lesions Detection Using Convolutional Neural Networks | Resquest Permission |
OASBUD | 78 | 200 | Matlab | 296.8MB | 2017 | Open access database of raw ultrasonic signals acquired from malignant and benign breast lesions | Downnload here |
BUS Synthetic Dataset | 0 | 500 | PNG | 9.7MB | 2023 | PDF-UNet: A semi-supervised method for segmentation of breast tumor images using a U-shaped pyramid-dilated network | Download here |
Summaries:
- BUSI: Small (around 500×500 px) ultrasound images suitable for classification of benign vs. malignant lesions and segmentation tasks.
- Breast Lesions USG: Ultrasound images capturing various lesions; ideal for lesion detection, classification, and segmentation.
- UDIAT Dataset B: Ultrasound scans for lesion analysis; can be used to develop detection and classification methods.
- OASBUD: Provides raw ultrasound signals, enabling advanced signal processing, segmentation, and classification methods.
- BUS Synthetic Dataset: Synthetic ultrasound images generated for model training and data augmentation, useful in classification and segmentation tasks.
Dataset | Subjects | Nº Samples | Format | Size | Year | Cite | Download |
---|---|---|---|---|---|---|---|
Breast Cancer Screening DBT | 5060 | 22 032 | DICOM | 1.63TB | 2024 | A Data Set and Deep Learning Algorithm for the Detection of Masses and Architectural Distortions in Digital Breast Tomosynthesis Images | Download here |
EA1141 | 1444 | 500 | DICOM | 2.82TB | 2023 | Abbreviated Breast MRI and Digital Tomosynthesis Mammography in Screening Women With Dense Breasts (EA1141) (Version 1) (dataset) | Download here |
VICTRE | 2994 | 2994 | DICOM | 1.03TB | 2019 | The VICTRE Trial: Open-Source, In-Silico Clinical Trial for Evaluating Digital Breast Tomosynthesis | Download here |
Summaries:
- Breast Cancer Screening DBT: High-resolution DBT volumes suitable for lesion detection and 3D reconstruction tasks.
- EA1141: High-quality DBT data, also associated with abbreviated MRI; supports multimodal analysis, lesion detection, and screening optimization.
- VICTRE: Simulated DBT data for evaluating algorithms in a controlled setting, useful for CAD development and comparative studies.
Summaries:
- CBIS-DDSM: A large set of annotated mammograms, excellent for classification, detection of calcifications, and mass segmentation tasks.
- CMMD: Mammograms from a Chinese cohort, useful for cross-population studies, lesion detection, and classification.
- CDD-CESM: Contrast-enhanced spectral mammography images supporting advanced analysis of vascularized lesions, aiding classification and differentiation tasks.
- VinDr-Mammo: Large-scale dataset with diverse annotations for robust AI model training in detection and classification.
- INBreast: High-quality full-field digital mammograms with detailed annotations, ideal for algorithm benchmarking.
- MIAS: Classic mammography dataset widely used for initial model training, testing basic classification and detection algorithms.
- Breast Tumor Mammography Dataset: A smaller dataset well-suited for entry-level experiments in tumor detection and basic classification.
Summaries:
- ACRIN-6667 & ACRIN-6698: Rich MRI datasets for assessing neoadjuvant chemotherapy response, ideal for detection, segmentation of lesions, and longitudinal analysis.
- ISPY1 & ISPY2: Multiparametric MRI for evaluating early response to therapy; supports predictive modeling, segmentation, and classification of treatment outcomes.
- Duke Breast Cancer MRI: High-quality DCE-MRI scans enabling lesion characterization, radiogenomics analysis, and segmentation tasks.
- Breast Cancer Patients MRI’s: JPG-format MRI slices suited for basic classification and proof-of-concept tasks.
- Breast MRI NACT Pilot: Focused on patients undergoing neoadjuvant chemotherapy, enabling treatment response analysis and lesion segmentation.
- QIN (Breast DCE-MRI, QIN-BREAST, QIN-BREAST-02): Small, high-quality sets for benchmarking quantitative imaging biomarkers, segmentation, and modeling treatment response.
- Advanced MRI Breast Lesions: Large, detailed dataset for evaluating complex MRI models, including advanced lesion segmentation and classification.
- BREAST DIAGNOSIS: DCE-MRI aimed at diagnostic feature extraction, supporting classification and lesion characterization tasks.
Summaries:
- Post NAT BRCA: High-resolution WSIs post-neoadjuvant therapy, ideal for quantifying residual disease, segmentation, and treatment response analysis.
- Breast Histopathology Images: Smaller PNG patches well-suited for basic classification and validation of histopathology models.
- BreakHis: Large-scale histopathology dataset, excellent for classification and detection of tumor subtypes at various magnifications.
- Breast Cancer Cell Segmentation: Focused on cell-level segmentation tasks, useful for training models that identify and count cells in tissues.
- BCSS: Annotated crowdsourced dataset enabling segmentation and classification tasks at the tissue level.
- TUPAC16: Whole-slide images for quantifying tumor proliferation; used in classification and mitosis detection challenges.
- CAMELYON: WSIs for lymph node metastasis detection and segmentation tasks.
- BACH: Balanced dataset of four histopathological classes (normal, benign, in situ, invasive) suitable for classification and region-of-interest detection.
-
Contributions:
Suggestions for new datasets, updates, or corrections are welcome. Please open an issue or submit a pull request. -
Contact:
For specific dataset access, follow the provided links or contact dataset maintainers directly. For repository-related questions or inquiries, feel free to email me at [email protected].