The PDF Deid Dataset is a fully synthetic collection of medical-style PDF documents created for de-identification tasks. All content is artificially generated using the Faker library and the GEMINI API, ensuring no real patient data is used. This dataset is ideal for training, benchmarking, and validating OCR and NLP models to identify and redact personally identifiable information (PII) in realistic medical document formats.
The dataset is divided into three levels: Easy, Medium, and Hard. The Easy set begins with 30 PDF files featuring clean layouts and consistent formatting. The Medium level builds on this by adding 10 more PDF files introducing varied formatting and noise. The Hard level includes all previous files, plus 10 more PDFS with further document formats like complexity and noise.
The dataset contains the following Personally Identifiable Information (PII) entities, designed to simulate real-world clinical records:
- Patient Name
- Patient Date of Birth (DOB)
- Patient Age
- Patient Social Security Number (SSN)
- Hospital ID
- Doctor Name
- Doctor ID
- Hospital Name
- Hospital Contact
- Other Dates
- Header
- Patient Summary (Paragraph)
- Patient Demographics (Form)
- Patient Lifestyle (Form)
- Patient Vitals (Form)
- Doctor Information (Form)
- Doctor Notes Section (Paragraph)
- Past Hospital Visits (Table)
- Current Medications (Table)
- Medical Tests (Table)
- Footer
- PHI in sections like Patient Demographics, Patient Lifestyle, Patient Vitals, and Doctor Information can span multiple lines for complexity.
- The Patient Summary section may include PHI (Name, DOB, Age) in either form fields or free-text format.
- Tables in Past Hospital Visits, Current Medications, and Medical Tests can appear either with borders or without.
- Medium PDF files feature a different layout and include added noise such as ink bleed, dirty screen effects, and subtle background textures.
- Hard PDF files contain densely packed information on single lines, which can challenge OCR systems. Additional noise includes punch holes (left), binder clips (right), noise texturisation, and moire patterns.
Level | Files Included | Avg. PHI Entities | % of Total PHI |
---|---|---|---|
🟢 Easy | 30 | 41 | DATE [50%], NAME [15%], HOSPITAL/ORGANISATION [15%], Phone Number [7.5%], AGE [5%], IDNUM [7.5%] |
🟡 Medium | 10 | 52 | DATE [38.8%], NAME [24.5%], ADDRESS [20.4%], Phone Number [4.1%], AGE [4.1%], IDNUM [8.1%] |
🔴 Hard | 10 | 46 | DATE [43.9%], NAME [22%], ADDRESS [12.2%], Phone Number [7.3%], AGE [4.9%], IDNUM [9.8%] |
-
Visual_NLP_Metrics.ipynb
- Guide for calculating NLP metrics using JSL packages.
- No pretrained pipeline included.
-
Visual_NLP_Pretrained.ipynb
- Guide for performing deidentification/obfuscation using pretrained pipelines.
-
Visual_NLP_ZeroShot_Metrics.ipynb
- Guide for calculating NLP metrics using JSL packages.
- Uses zero-shot stages for specific PHI detection.
- Patient Name, Patient DOB and Patient SSN are included for de-identification.
-
PDF Original:
- Easy [ Contains 30 Easy PDF Files ]
- Medium [ Contains 10 Medium PDF Files ]
- Hard [ Contains 10 hard PDF Files ]
-
Mapping:
- Ground Truth Files [ JSON files
pdf_deid_gts_*.json
contain ground truth data. ] - Predicted Mapping Files [ JSON files
*_result_mapping.json
have predicted values, ground truth, precision, and recall. ]
- Ground Truth Files [ JSON files
-
Sample Result: [ Contains PDF example outputs with black bbox in place of PHI. ]
Difficulty Level | Precision | Recall | F1-Score | Total Files |
---|---|---|---|---|
🟢 Easy | 0.9851 | 0.9799 | 0.9825 | 30 |
🟡 Medium | 0.9800 | 0.9575 | 0.9686 | 40 |
🟡 Zero Shot Medium | 0.9861 | 1 | 0.993 | 10 |
🔴 Hard | 0.9561 | 0.9290 | 0.9424 | 50 |