add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057

aws-yyjau · 2025-04-29T19:28:10Z

Revision from #9050

This PR introduces a new configuration flag torch_xla_graph_execution_check_level to provide better visibility into tensor synchronization operations during XLA graph execution. The AWS neuron team will use this flag during HLO conversion so that it can help developers catch the issues of evaluating the input tensor value during compilation.

Key changes:

Added new configuration flag torch_xla_graph_execution_check_level (default: disabled).
Implemented warning/error logging during tensor synchronization events
Added Python stack trace output for debugging tensor sync operations
Log messages include relevant context about the tensor shape being performed
Throw an error when we set to level 2

The checking levels supported are:

DISABLED (default): No logging
WARNING (value: 1): check and log tensor sync operations as warnings
ERROR (value: 2): check and log tensor sync operations as warnings and throw an XLA error

Example usage:

import torch_xla
torch_xla._XLAC._set_torch_xla_graph_execution_check_level(1)

This enhancement helps developers:

The team saw issues when we use the tensor value for if-else statement.
For example,

def forward(self, tensor):
  if tensor[0] == 1:
     return tensor
  else:
     return tensor * 2

The example above can compile and run. However, it may make the developers to believe the tensors can be evaluated on the fly, leading to unexpected behaviors.
With the change and some other future changes during graph tracing, we can

Identify potential code path issue in XLA graph execution and prevent the users from using tensor values in the graph
Debug and trace tensor synchronization issues more effectively

Testing:

Added unit tests for different check levels
Verified log output format and stack trace information
Tested with tensor sync scenarios

Documentation (To do):

Add usage examples and checking level descriptions
Include troubleshooting guide for common tensor sync issues

…t emits warning(1) or throw error(2) during tensor sync and output the python frame

tengyifei

Hi, thanks for the contribution!

What is the relationship of this feature with PT_XLA_DEBUG_LEVEL c.f. https://github.com/pytorch/xla/blob/master/docs/source/learn/troubleshoot.md#pytorchxla-debugging-tool ?

Would it be possible to improve the existing feature and/or maybe introduce a Pythonic API for it?

aws-yyjau · 2025-05-01T18:30:15Z

Hi, thanks for the contribution!

What is the relationship of this feature with PT_XLA_DEBUG_LEVEL c.f. https://github.com/pytorch/xla/blob/master/docs/source/learn/troubleshoot.md#pytorchxla-debugging-tool ?

It is different from PT_XLA_DEBUG_LEVEL because this new flag will be used during model tracing to avoid unexpected path. However, PT_XLA_DEBUG_LEVEL is an XLA level debugging level that supports debugging purposes.

Would it be possible to improve the existing feature and/or maybe introduce a Pythonic API for it?

If there's any suggestion of where the feature can be merged with existing flags/feature, please let me know. Thanks.

root and others added 2 commits April 29, 2025 19:20

add torch_xla_graph_execution_check_level (default disabled) flag tha…

78dc0e5

…t emits warning(1) or throw error(2) during tensor sync and output the python frame

update test case name

631c3bb

aws-yyjau mentioned this pull request Apr 29, 2025

add torch_xla_graph_execution_log_level (default disabled) flag that emits warning/error during tensor sync and output the python frame #9050

Closed

aws-yyjau added 2 commits April 29, 2025 20:07

run cpp formatter

0c155a8

run python formatter

7fd11d9

jeffhataws requested review from tengyifei, rpsilva-aws and zpcore April 29, 2025 21:03

aws-yyjau mentioned this pull request Apr 30, 2025

[cherry-pick to r2.6.1_aws_neuron] add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9067

Closed

tengyifei reviewed May 1, 2025

View reviewed changes

aws-yyjau mentioned this pull request May 1, 2025

[cherry-pick to r2.6_aws_neuron] add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057

add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057

aws-yyjau commented Apr 29, 2025 •

edited

Loading

tengyifei left a comment

aws-yyjau commented May 1, 2025

add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057

Are you sure you want to change the base?

add torch_xla_graph_execution_check_level (default disabled) flag that emits warning(1) or throw error(2) during tensor sync and output the python frame #9057

Conversation

aws-yyjau commented Apr 29, 2025 • edited Loading

Key changes:

The checking levels supported are:

Example usage:

This enhancement helps developers:

Testing:

Documentation (To do):

tengyifei left a comment

Choose a reason for hiding this comment

aws-yyjau commented May 1, 2025

aws-yyjau commented Apr 29, 2025 •

edited

Loading