Chart review doc gathering status #299

dogversioning · 2024-03-13T12:36:54Z

This PR makes the following changes:

Adds a status indicator for gathering documents
Backs off the strict type enforcement for documents in favor of skipping and reporting an error.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

dogversioning · 2024-03-13T12:37:59Z

cumulus_etl/fhir/fhir_utils.py

+        # We've found a real world case where we've found missing data preventing note download; for now,
+        # we are just notifying about it and returning an empty string object.
+        # If we find a actually need this data, we can change this later.
        # But note that if we do, we'll need to handle downloading Binary FHIR objects, in addition to arbitrary URLs.
-        raise ValueError("No textual mimetype found")
+
+        print(f"Found unexpected mime type '{mimetype}', skipping.")
+        return ""


the type in question that caused this error was application/xml - wanted to talk about this change before commiting to this approach.

We can talk about this during our meeting today, but I'm guessing this was a Cerner thing they do - they wrap the doc in a Cerner-specific xml format, with an RDF doc base64-encoded inside. Usually they generate the html version of the doc later in the doc lifecycle, like when it becomes a status=final or whatever the status is.

So I'm leery of trying to support application/xml - we already (very optimistically) support application/xhtml+xml. But arbitrary xml is harder to know what to do with (i.e. how to extract the note text).

OK after talking about this in person, I think your beef was more that unhandled mimetypes were stopping upload-notes in its tracks, rather than you wanting to handle application/xml. That makes sense.

So I might vote for upload-notes catching the ValueError here and printing something sensible there. This method is also used by the ETL and I believe it catches errors and prints the message itself. So that's the established error-processing approach for this anyway.

I'm fine with that - is it cool if I keep the erroring mimetype info in the ValueError?

Sure - I was having vibes that it might be too low level info, but 🤷 give the people what they crave - mimetypes

mikix · 2024-03-13T12:47:10Z

cumulus_etl/common.py

@@ -318,6 +319,13 @@ def print_header(name: str | None = None) -> None:
    if name:
        print(name)

+def get_transient_progress()-> (progress.Progress):


nit: this might more naturally fit in cli_utils.py (which is where make_progress_bar() lives). I'm trying to roughly group misc functions. (Maybe even convert this into a transient=False kwarg on that method?)

How do you feel about whether a TimeElapsed column fits here? Helpful or noisy?

Also, why the parens on the return type hint? Is that something your IDE or ruff encourages?

oh at one point i was passing back a tuple - i'll remove those.

TimeElapsed is a fine addition.

mikix · 2024-03-13T12:53:20Z

cumulus_etl/fhir/fhir_utils.py

+        # We've found a real world case where we've found missing data preventing note download; for now,
+        # we are just notifying about it and returning an empty string object.
+        # If we find a actually need this data, we can change this later.
        # But note that if we do, we'll need to handle downloading Binary FHIR objects, in addition to arbitrary URLs.
-        raise ValueError("No textual mimetype found")
+
+        print(f"Found unexpected mime type '{mimetype}', skipping.")
+        return ""


We can talk about this during our meeting today, but I'm guessing this was a Cerner thing they do - they wrap the doc in a Cerner-specific xml format, with an RDF doc base64-encoded inside. Usually they generate the html version of the doc later in the doc lifecycle, like when it becomes a status=final or whatever the status is.

So I'm leery of trying to support application/xml - we already (very optimistically) support application/xhtml+xml. But arbitrary xml is harder to know what to do with (i.e. how to extract the note text).

mikix · 2024-03-13T12:54:54Z

cumulus_etl/upload_notes/selector.py

+            scanned += 1
+            progress.update(task, description = f"Unmatched IDs: {len(real_docref_ids)}, found: {found}, scanned: {scanned}")


minor nit: you'll miss the last progress update here if we early exit because we found all the target docrefs.

mikix · 2024-03-13T12:58:02Z

cumulus_etl/upload_notes/selector.py

+                if not real_docref_ids:
+                    break
+            scanned += 1
+            progress.update(task, description = f"Unmatched IDs: {len(real_docref_ids)}, found: {found}, scanned: {scanned}")

 def _filter_fake_docrefs(codebook: deid.Codebook, anon_docrefs_csv: str, docrefs: Iterable[dict]) -> Iterator[dict]:


These two filter methods were already very similar (and maybe should have been sharing more code between them). With the extra progress reporting, this is triggering my code sense that they should share more of their guts.

The only real difference is how to calculate the id to check for each docref, right? That could be a lambda arg or something, I dunno.

Chart review doc gathering status

217103f

dogversioning commented Mar 13, 2024

View reviewed changes

mikix reviewed Mar 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chart review doc gathering status #299

Chart review doc gathering status #299

dogversioning commented Mar 13, 2024

dogversioning Mar 13, 2024

mikix Mar 13, 2024

mikix Mar 13, 2024

dogversioning Mar 13, 2024

mikix Mar 13, 2024

mikix Mar 13, 2024

dogversioning Mar 13, 2024

mikix Mar 13, 2024

mikix Mar 13, 2024

mikix Mar 13, 2024

		scanned += 1
		progress.update(task, description = f"Unmatched IDs: {len(real_docref_ids)}, found: {found}, scanned: {scanned}")

Chart review doc gathering status #299

Are you sure you want to change the base?

Chart review doc gathering status #299

Conversation

dogversioning commented Mar 13, 2024

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment