Skip to content

Conversation

@zxd1997066
Copy link
Contributor

@zxd1997066 zxd1997066 commented Sep 19, 2025

This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT

The distributed UT time will increase to 1h22min with 2 work groups https://github.com/intel/torch-xpu-ops/actions/runs/18632171421/job/53144659640

disable_e2e
disable_ut

@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch 13 times, most recently from 0d9b54f to 85fa6f1 Compare September 25, 2025 14:29
Copy link
Contributor

@chuanqi129 chuanqi129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the test scope as CI scope and nightly full scope


inputs:
ut_name:
required: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
required: true
required: false

ze = xpu_list[i+1];
} else {
ze = i;
if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:

  • No Xelink group, return failed
  • 1 Xelink group, launch 1 worker
  • 2 Xelink group, launch 2 workers
  • ...

runner:
runs-on: ${{ inputs.runner }}
name: get-runner
name: get-runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have such change?

@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch 14 times, most recently from 3ac75d3 to 82b8ed3 Compare October 17, 2025 06:28
@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch 5 times, most recently from 63799f6 to 1bcbce2 Compare October 19, 2025 15:05
@zxd1997066 zxd1997066 requested a review from chuanqi129 October 20, 2025 08:52
@zxd1997066
Copy link
Contributor Author

Please split the test scope as CI scope and nightly full scope

firstly added cases for CI in this PR, will enable nightly test in another PR

@zxd1997066 zxd1997066 force-pushed the xiangdong/ported_cases branch from 1bcbce2 to 87e933c Compare October 20, 2025 14:16
@zxd1997066
Copy link
Contributor Author

zxd1997066 commented Oct 22, 2025

Seems some nodes have issue when running distributed test on the second group worker, like dut7362 https://github.com/intel/torch-xpu-ops/actions/runs/18654925169/job/53261876278#step:4:741, there are extra failures, and stop due to unknown issue.

A previous good node is gm03pappvc004 https://github.com/intel/torch-xpu-ops/actions/runs/18632171421/job/53144659640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants