A validation tool for DICOM files used by the Laboratory Catalog and Archive Service (LabCAS) of the Early Detection Research Network (EDRN). This program ensures that DICOM files:
- Contain little-to-no PHI/PII β Scans both DICOM headers and pixel data for protected health information (PHI) and personally identifiable information (PII)
- Adhere to EDRN requirements β Validates DICOM tags against the EDRN core and MR requirements
This tool was developed in response to EDRN/EDRN-metadata#160.
This program has features described in the following subsections.
- Header-based detection: Scans DICOM metadata tags for identifiers including:
- Patient names, birth dates, addresses
- Physician and operator names
- Email addresses, phone numbers, SSNs
- Medical record numbers (MRNs)
- Pixel-based detection: Uses OCR (Tesseract) to detect text embedded in DICOM images
- Multiple recognizers: Choose between different PHI/PII detection algorithms:
simple-scoring(default): Pattern-based detection with configurable scoringaccepting: Accepts all files (testing only)rejecting: Rejects all files (testing only)
Validates over 40 DICOM tags against EDRN requirements including:
- Study/Series/Image Identification: UIDs, instance numbers, SOP class
- Acquisition Modality and Equipment: Modality codes, manufacturer info, device details
- Temporal Data: Dates and times in proper format
- Image Data: Dimensions, pixel data, display parameters
- MR-specific: Spacing between slices validation
Generates detailed Markdown reports organized by:
- Site ID
- Event ID
- File name
- Finding type and severity score
Details on installing this software follows in this section.
Requires Python 3.12 or higher and Tesseract OCR for pixel-based PHI/PII detection.
Tesseract provides optical character recgonition features for this program and must be installed separately.
macOS:
brew install tesseractLinux (Ubuntu/Debian):
sudo apt-get install tesseract-ocrWindows: Download from https://github.com/UB-Mannheim/tesseract/wiki
It's best to set up a Python virtual environment and use pip to install it into that environment:
pip install jpl.labcas.validation
Or install from source:
git clone https://github.com/EDRN/jpl.labcas.validation.git
cd jpl.labcas.validation
pip install --editable .The following describes how to use this program.
The easiest way to run this is:
validate-dicom-files <directory>
the <directory> should eventually contain the following directory hierarchy:
<directory>
β¦ (sub-directories)
collection-folder (such as Prostate_MRI)
event-ID-folder (such as 1234567)
β¦ (sub-folders)
DICOM file 1
DICOM file 2
β¦
Use --help to get more details, but summarizing:
-s, --score <value>: Maximum PHI/PII score threshold (0.0-1.0, default: 0.8)-c, --concurrency <num>: Number of concurrent processes (default: CPU count)-r, --recognizer <name>: PHI/PII recognizer to use:simple-scoring(default): Pattern-based detectionaccepting: Accept all filesrejecting: Reject all files
-o, --output <file>: Output file for report (default: report.md)-v, --verbose: Verbose logging-q, --quiet: Quiet logging
Validate a directory with default settings:
validate-dicom-files /path/to/dicom/files
Use a different PHI/PII threshold (lower = less strict):
validate-dicom-files --score 0.5 /path/to/dicom/files
Generate a custom report filename:
validate-dicom-files --output validation_results.md /path/to/dicom/files
Use a specific number of workers:
validate-dicom-files --concurrency 4 /path/to/dicom/files
In general, use a --concurrency equal to at least the number of CPU cores available. Some recommend using twice that number.
The tool generates a Markdown report with findings organized hierarchically:
- By Site ID: Grouped by blinded site identifier
- By Event ID: Grouped by 7-digit event ID
- By File: Individual DICOM files within each event
- By Finding: Each finding includes:
- Score: Severity from 0.0 (low) to 1.0 (high)
- Kind: Type of finding:
- π Header: PHI/PII found in DICOM metadata
- πΌοΈ Pixels: PHI/PII found in image data via OCR
β οΈ Validation: Tag compliance issue- β Error: File reading or processing error
- Details: Specific information about the finding
Only findings with scores above the threshold are included in the report.
The validation framework is modular and extensible:
- PHI/PII Recognizers: Plug-in system for different detection algorithms
- Validators: Individual validators for each DICOM tag requirement
- Findings: Structured representation of all issues discovered
Development Status: Pre-Alpha
CT requirements may be added in the future, pending completion of the spreadsheet's CT tab.
Apache 2.0 - See LICENSE.md for details
Issues and pull requests welcome on GitHub: https://github.com/EDRN/jpl.labcas.validation/issues. See also the EDRN Code of Conduct and Contributors' Guide.
- Sean Kelly
@nutjob4life
Copyright Β© 2025 California Institute of Technology. U.S. Government sponsorship acknowledged.