add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Revision from #9050
This PR introduces a new configuration flag
torch_xla_graph_execution_check_level
to provide better visibility into tensor synchronization operations during XLA graph execution. The AWS neuron team will use this flag during HLO conversion so that it can help developers catch the issues of evaluating the input tensor value during compilation.Key changes:
torch_xla_graph_execution_check_level
(default: disabled).The checking levels supported are:
Example usage:
This enhancement helps developers:
The team saw issues when we use the tensor value for if-else statement.
For example,
The example above can compile and run. However, it may make the developers to believe the tensors can be evaluated on the fly, leading to unexpected behaviors.
With the change and some other future changes during graph tracing, we can
Testing:
Added unit tests for different check levels
Verified log output format and stack trace information
Tested with tensor sync scenarios
Documentation (To do):
Add usage examples and checking level descriptions
Include troubleshooting guide for common tensor sync issues