-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More aggressively mask user code errors when masking enabled #27183
More aggressively mask user code errors when masking enabled #27183
Conversation
333ee23
to
c6aef4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a summary is warranted for this PR given its complexity
error_id_by_exception: ContextVar[Mapping[int, str]] = ContextVar( | ||
"error_id_by_exception", default={} | ||
) | ||
|
||
|
||
@contextlib.contextmanager | ||
def redact_user_stacktrace_if_enabled(): | ||
"""Context manager which, if a user has enabled redacting user code errors, logs exceptions raised from within, | ||
and clears the stacktrace from the exception. It also marks the exception to be redacted if it was to be persisted | ||
or otherwise serialized to be sent to Dagster Plus. This is useful for preventing sensitive information from | ||
being leaked in error messages. | ||
""" | ||
if not _should_redact_user_code_error(): | ||
yield | ||
else: | ||
try: | ||
yield | ||
except BaseException as e: | ||
exc_info = sys.exc_info() | ||
|
||
# Generate a unique error ID for this error, or re-use an existing one | ||
# if this error has already been seen | ||
existing_error_id = error_id_by_exception.get().get(id(e)) | ||
|
||
if not existing_error_id: | ||
error_id = str(uuid.uuid4()) | ||
|
||
# Track the error ID for this exception so we can redact it later | ||
error_id_by_exception.set({**error_id_by_exception.get(), id(e): error_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way you are using the context var here is equivalent to just having a process global dict. What exactly is the intention here and do any of the existing tests validate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe the goal is to increase the set of cases that this logic can handle (all kinds of exceptions besides DagsterUserCodeExecutionError can be emitted within user code, like KeyboardInterrupt or SystemExit or other DagsterError subclasses) while still only triggering the redaction if the exception was actually raised within a op_execution_error_boundary or user_code_error_boundary. I don't have a strong opinion about global dict vs. contextvar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped contextvar - as Daniel mentioned the main goal is to track which exceptions come out of a user code context, and to link them to a user-visible ID so we can render that in the UI when the redacted error is shown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be out next week so may lean on you alex to do the final accept once you have full context, but this broadly makes sense to me
error_id_by_exception: ContextVar[Mapping[int, str]] = ContextVar( | ||
"error_id_by_exception", default={} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name here should probably have 'redacted' in it: redacted_user_code_error_id_by_exception?
error_id_by_exception: ContextVar[Mapping[int, str]] = ContextVar( | ||
"error_id_by_exception", default={} | ||
) | ||
|
||
|
||
@contextlib.contextmanager | ||
def redact_user_stacktrace_if_enabled(): | ||
"""Context manager which, if a user has enabled redacting user code errors, logs exceptions raised from within, | ||
and clears the stacktrace from the exception. It also marks the exception to be redacted if it was to be persisted | ||
or otherwise serialized to be sent to Dagster Plus. This is useful for preventing sensitive information from | ||
being leaked in error messages. | ||
""" | ||
if not _should_redact_user_code_error(): | ||
yield | ||
else: | ||
try: | ||
yield | ||
except BaseException as e: | ||
exc_info = sys.exc_info() | ||
|
||
# Generate a unique error ID for this error, or re-use an existing one | ||
# if this error has already been seen | ||
existing_error_id = error_id_by_exception.get().get(id(e)) | ||
|
||
if not existing_error_id: | ||
error_id = str(uuid.uuid4()) | ||
|
||
# Track the error ID for this exception so we can redact it later | ||
error_id_by_exception.set({**error_id_by_exception.get(), id(e): error_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe the goal is to increase the set of cases that this logic can handle (all kinds of exceptions besides DagsterUserCodeExecutionError can be emitted within user code, like KeyboardInterrupt or SystemExit or other DagsterError subclasses) while still only triggering the redaction if the exception was actually raised within a op_execution_error_boundary or user_code_error_boundary. I don't have a strong opinion about global dict vs. contextvar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to your queue to dial this in for final review
c6aef4f
to
ff77f7b
Compare
Summary added + lots of comments dropped into code |
@@ -94,6 +97,106 @@ def _should_redact_user_code_error() -> bool: | |||
) | |||
|
|||
|
|||
redacted_exception_id_to_user_facing_identifier: dict[int, str] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3/library/weakref.html#weakref.WeakKeyDictionary i think might be better
("Failure", True, lambda: Failure("asdf")), | ||
], | ||
) | ||
def test_masking_op_execution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this test verify the error id stuff is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explicit check that the error ID shows up in the right spots
## Summary Follow-on to the functionality introduced in #18688. The original PR ensures that if the `DAGSTER_REDACT_USER_CODE_ERRORS` env var is set to `1`, instances of `DagsterUserCodeExecutionError` which are serialized are masked so that their potentially sensitive contents are not leaked. However, there is a class of errors that can still expose elements of user code: system/framework errors or interrupts, for example, will include their stack trace and snippets of user code. This PR attempts to handle this second class of errors, in a few ways: - Any error which is raised in a user code context has its stacktrace logged and then cleared, so if the error makes it to serialization, it will only include the error message. - User code exceptions have their `original_exc_info` purged. - A global mapping stores the error, so that a unique error ID can be raised in the UI, which can be used to locate the original error in logs. ## Test Plan New unit tests which specifically handle the intterrupt/system error cases.
Summary
Follow-on to the functionality introduced in #18688. The original PR ensures that if the
DAGSTER_REDACT_USER_CODE_ERRORS
env var is set to1
, instances ofDagsterUserCodeExecutionError
which are serialized are masked so that their potentially sensitive contents are not leaked. However, there is a class of errors that can still expose elements of user code: system/framework errors or interrupts, for example, will include their stack trace and snippets of user code.This PR attempts to handle this second class of errors, in a few ways:
original_exc_info
purged.Test Plan
New unit tests which specifically handle the intterrupt/system error cases.