Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chart review doc gathering status #299

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dogversioning
Copy link
Contributor

This PR makes the following changes:

  • Adds a status indicator for gathering documents
  • Backs off the strict type enforcement for documents in favor of skipping and reporting an error.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

Comment on lines +204 to +210
# We've found a real world case where we've found missing data preventing note download; for now,
# we are just notifying about it and returning an empty string object.
# If we find a actually need this data, we can change this later.
# But note that if we do, we'll need to handle downloading Binary FHIR objects, in addition to arbitrary URLs.
raise ValueError("No textual mimetype found")

print(f"Found unexpected mime type '{mimetype}', skipping.")
return ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type in question that caused this error was application/xml - wanted to talk about this change before commiting to this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can talk about this during our meeting today, but I'm guessing this was a Cerner thing they do - they wrap the doc in a Cerner-specific xml format, with an RDF doc base64-encoded inside. Usually they generate the html version of the doc later in the doc lifecycle, like when it becomes a status=final or whatever the status is.

So I'm leery of trying to support application/xml - we already (very optimistically) support application/xhtml+xml. But arbitrary xml is harder to know what to do with (i.e. how to extract the note text).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK after talking about this in person, I think your beef was more that unhandled mimetypes were stopping upload-notes in its tracks, rather than you wanting to handle application/xml. That makes sense.

So I might vote for upload-notes catching the ValueError here and printing something sensible there. This method is also used by the ETL and I believe it catches errors and prints the message itself. So that's the established error-processing approach for this anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that - is it cool if I keep the erroring mimetype info in the ValueError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I was having vibes that it might be too low level info, but 🤷 give the people what they crave - mimetypes

@@ -318,6 +319,13 @@ def print_header(name: str | None = None) -> None:
if name:
print(name)

def get_transient_progress()-> (progress.Progress):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might more naturally fit in cli_utils.py (which is where make_progress_bar() lives). I'm trying to roughly group misc functions. (Maybe even convert this into a transient=False kwarg on that method?)

How do you feel about whether a TimeElapsed column fits here? Helpful or noisy?

Also, why the parens on the return type hint? Is that something your IDE or ruff encourages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh at one point i was passing back a tuple - i'll remove those.

TimeElapsed is a fine addition.

Comment on lines +204 to +210
# We've found a real world case where we've found missing data preventing note download; for now,
# we are just notifying about it and returning an empty string object.
# If we find a actually need this data, we can change this later.
# But note that if we do, we'll need to handle downloading Binary FHIR objects, in addition to arbitrary URLs.
raise ValueError("No textual mimetype found")

print(f"Found unexpected mime type '{mimetype}', skipping.")
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can talk about this during our meeting today, but I'm guessing this was a Cerner thing they do - they wrap the doc in a Cerner-specific xml format, with an RDF doc base64-encoded inside. Usually they generate the html version of the doc later in the doc lifecycle, like when it becomes a status=final or whatever the status is.

So I'm leery of trying to support application/xml - we already (very optimistically) support application/xhtml+xml. But arbitrary xml is harder to know what to do with (i.e. how to extract the note text).

Comment on lines +67 to +68
scanned += 1
progress.update(task, description = f"Unmatched IDs: {len(real_docref_ids)}, found: {found}, scanned: {scanned}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: you'll miss the last progress update here if we early exit because we found all the target docrefs.

if not real_docref_ids:
break
scanned += 1
progress.update(task, description = f"Unmatched IDs: {len(real_docref_ids)}, found: {found}, scanned: {scanned}")

def _filter_fake_docrefs(codebook: deid.Codebook, anon_docrefs_csv: str, docrefs: Iterable[dict]) -> Iterator[dict]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two filter methods were already very similar (and maybe should have been sharing more code between them). With the extra progress reporting, this is triggering my code sense that they should share more of their guts.

The only real difference is how to calculate the id to check for each docref, right? That could be a lambda arg or something, I dunno.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants