Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not understanding why DICOM redaction does not detect Patient Name on example data #1309

Open
parataaito opened this issue Feb 23, 2024 · 14 comments

Comments

@parataaito
Copy link

parataaito commented Feb 23, 2024

Hello !

First, thanks for this tool, it looks very promising, so congrats on the idea!

I have a question though.
I followed walkthrough from here:
I used the "0_ORIGINAL.dcm" file from the test files.

Here is my code to show it seems identical to the tutorial:

import pydicom
from presidio_image_redactor import DicomImageRedactorEngine
import matplotlib.pyplot as plt

def compare_dicom_images(
    instance_original: pydicom.dataset.FileDataset,
    instance_redacted: pydicom.dataset.FileDataset,
    figsize: tuple = (11, 11)
) -> None:
    """Display the DICOM pixel arrays of both original and redacted as images.

    Args:
        instance_original (pydicom.dataset.FileDataset): A single DICOM instance (with text PHI).
        instance_redacted (pydicom.dataset.FileDataset): A single DICOM instance (redacted PHI).
        figsize (tuple): Figure size in inches (width, height).
    """
    _, ax = plt.subplots(1, 2, figsize=figsize)
    ax[0].imshow(instance_original.pixel_array, cmap="gray")
    ax[0].set_title('Original')
    ax[1].imshow(instance_redacted.pixel_array, cmap="gray")
    ax[1].set_title('Redacted')
    plt.show()
    
# Set input and output paths
input_path = "0_ORIGINAL.dcm"
output_dir = "./output"

# Initialize the engine
engine = DicomImageRedactorEngine()

# Option 1: Redact from a loaded DICOM image
dicom_image = pydicom.dcmread(input_path)
redacted_dicom_image = engine.redact(dicom_image, use_metadata=True, fill="contrast")

compare_dicom_images(dicom_image, redacted_dicom_image)

However, my output is this:
image

I don't understand why the Patient Name is not redacted like it is on your example :
image

For additional info, I am using Python 3.11.2 (but I tried with 3.9 too).

PS: I did not put it in bug since I am not 100% sure it is. It's probably on my side but I have no idea where it comes from...

Thanks in advance :)

@parataaito parataaito changed the title No understanding how DICOM redaction works No understanding why DICOM redaction does not detect Patient Name Feb 23, 2024
@parataaito parataaito changed the title No understanding why DICOM redaction does not detect Patient Name No understanding why DICOM redaction does not detect Patient Name on example data Feb 23, 2024
@parataaito parataaito changed the title No understanding why DICOM redaction does not detect Patient Name on example data Not understanding why DICOM redaction does not detect Patient Name on example data Feb 23, 2024
@parataaito
Copy link
Author

Just want to add that I also followed the example_dicom_image_redactor.ipynb
Here are my results:
image
image
image
image

@parataaito
Copy link
Author

Hello !
It's been a month now and no news :'(
Anybody had the same problem and managed to solve it?

@omri374
Copy link
Contributor

omri374 commented Mar 28, 2024

Apologies for the delay. We will look into this soon and report back.

@omri374
Copy link
Contributor

omri374 commented Mar 29, 2024

@parataiito a hotfix was created a a new version released. Could you please check again? Apologies for the late resolution on this!

@omri374
Copy link
Contributor

omri374 commented Mar 29, 2024

Closing for now, please re-open if needed.

@omri374 omri374 closed this as completed Mar 29, 2024
@parataaito
Copy link
Author

Thanks for the (very) quick reply!
Going to check right away!

@parataaito
Copy link
Author

parataaito commented Mar 29, 2024

Works like a charm on all the demo files! So that's perfect!

I also tested them on random data I generated and I was wondering if you understand why it does not work specifically on this on : sample_data.zip

image

Is it due to the fact the data I burnt in the pixel array is not matched to any value in the DICOM tags?

@omri374
Copy link
Contributor

omri374 commented May 1, 2024

The DICOM redactor either takes values from the tags, or uses different text based approaches to identify entities such as names. In this case the default spaCy model used by Presidio does is not able to detect "ez OY" as a name, but a different model can. I would suggest experimenting with changing Presidio's configuration. For example:

import pydicom

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import TransformersNlpEngine
from presidio_image_redactor import ImageAnalyzerEngine, DicomImagePiiVerifyEngine, DicomImageRedactorEngine
model_config = [
    {
        "lang_code": "en",
        "model_name": {
            "spacy": "en_core_web_sm",
            "transformers": "StanfordAIMI/stanford-deidentifier-base",
        },
    }
]

nlp_engine = TransformersNlpEngine(models=model_config)
text_analyzer_engine = AnalyzerEngine(nlp_engine=nlp_engine)
image_analyzer_engine = ImageAnalyzerEngine(analyzer)
dicom_engine = DicomImagePiiVerifyEngine(image_analyzer_engine=image_analyzer_engine)

instance = pydicom.dcmread(file_of_interest)
verify_image, ocr_results, analyzer_results = dicom_engine.verify_dicom_instance(instance, padding_width=25, show_text_annotation=True)

Running this version with the spaCy model does not identify the bounding box with a name as PII, whereas this transformers model (StanfordAIMI/stanford-deidentifier-base) does. I would suggest to further look into ways to improve and customize the PII detection flows with Presidio: https://microsoft.github.io/presidio/tutorial/

@jhssilva
Copy link

jhssilva commented May 9, 2024

Hi @omri374 .
I've the problem that the DICOM Redaction doesn't detect the text on the header. Please refer to the following image. (I'll redact the data from the patience and set as blur as this is an official image.)
2024-05-09_14-54-01
This is the code that I'm currently using:

input_path = "./test"
output_dir = "./output"

engine = DicomImageRedactorEngine()

pattern_all_text = Pattern(name="any_text", regex=r"(?s).*", score=0.5)
custom_recognizer = PatternRecognizer(
    supported_entity="TEXT",
    patterns=[pattern_all_text]
)

dicom_image = pydicom.dcmread(input_path)
redacted_dicom_image = engine.redact(dicom_image, fill="background", use_metadata=False , ad_hoc_recognizers = [custom_recognizer], allow_list=[])
redacted_dicom_image.save_as(f"{output_dir}/redacted_dicom.dcm")

redact_image = pydicom.dcmread(output_dir + "/redacted_dicom.dcm")
redact_image = redact_image.pixel_array
plt.imshow(redact_image, cmap='gray')
plt.show()

It redacts all the information less the header.

@omri374
Copy link
Contributor

omri374 commented May 9, 2024

It could be an OCR issue, where the OCR just can't detect the bounding box. Have you looked into the bounding boxes returned by the OCR?

@omri374 omri374 reopened this May 9, 2024
@omri374
Copy link
Contributor

omri374 commented May 9, 2024

adding @niwilso and @ayabel in case they have any recommendations here as DICOM experts.

@jhssilva
Copy link

jhssilva commented May 11, 2024

Thank you for the answer @omri374.
Should I look into something particular in the bboxes?

This is the output of the simple program.
2024-05-11_10-44-59
2024-05-11_11-25-34

I've followed the following documentation. The header doesn't seem to be detected by the bboxes.

Regarding the image this is an DICOM image ultrasound. Even if I save it as a normal image and then use presidio the issue persists.

@ayabel
Copy link
Collaborator

ayabel commented May 12, 2024

hi @jhssilva, it might be because the contrast between the text and the background is relatively low. In this case, you might want to consider preprocessing the image before feeding it to the redactor. Ideas for such preprocessing functions could be found here:

presidio-image-redactor/presidio_image_redactor/image_processing_engine.py
Specifically, applying the cv2.adaptiveThreshold function could help increase the contrast

@jhssilva
Copy link

jhssilva commented May 15, 2024

Hey @ayabel . Thank you for your input and guidance.

I've tested with the adaptiveThreshold as suggested.
However in my case it creates a problem as I need the images to stay with the original contrast. (for now, possibly it will change in the future)

Being said that I've decided to take a different approach.
Selecting the top part of the image redacting and then bundle the images together. This approach seems to work.
Example,

pattern_all_text = Pattern(name="any_text", regex=r"(?s).*", score=0.5)
custom_recognizer = PatternRecognizer(
    supported_entity="TEXT",
    patterns=[pattern_all_text]
)
dicom_image = Image.open("new_image.png")

top_height = 60

# Convert the original image to a numpy array
image = np.array(dicom_image)

top_part = image[0:top_height, :]

rest_of_image = image[top_height:, :]

# Convert the top part of the image back to a PIL Image
top_part_image = Image.fromarray(top_part)

redacted_image = redactor_image.redact(top_part_image, fill="black", ad_hoc_recognizers=[custom_recognizer], allow_list=[])

final_image = np.concatenate((redacted_image, rest_of_image), axis=0)

plt.imshow(final_image)
plt.show()

Note: In this example I didn't redact the bottom part of the image.

Suggestion: Would be nice to have an example to such cases in the documentation as using the adaptive treshold or use the approach that I've suggested to specific cases.

Image Output
2024-05-15_22-50-48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants