-
Notifications
You must be signed in to change notification settings - Fork 60
[CI] Add more ported distributed cases #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0d9b54f to
85fa6f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the test scope as CI scope and nightly full scope
|
|
||
| inputs: | ||
| ut_name: | ||
| required: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| required: true | |
| required: false |
| ze = xpu_list[i+1]; | ||
| } else { | ||
| ze = i; | ||
| if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:
- No Xelink group, return failed
- 1 Xelink group, launch 1 worker
- 2 Xelink group, launch 2 workers
- ...
.github/workflows/_linux_ut.yml
Outdated
| runner: | ||
| runs-on: ${{ inputs.runner }} | ||
| name: get-runner | ||
| name: get-runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have such change?
3ac75d3 to
82b8ed3
Compare
63799f6 to
1bcbce2
Compare
firstly added cases for CI in this PR, will enable nightly test in another PR |
1bcbce2 to
87e933c
Compare
|
Seems some nodes have issue when running distributed test on the second group worker, like dut7362 https://github.com/intel/torch-xpu-ops/actions/runs/18654925169/job/53261876278#step:4:741, there are extra failures, and stop due to unknown issue. A previous good node is gm03pappvc004 https://github.com/intel/torch-xpu-ops/actions/runs/18632171421/job/53144659640 |
This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT
The distributed UT time will increase to 1h22min with 2 work groups https://github.com/intel/torch-xpu-ops/actions/runs/18632171421/job/53144659640
disable_e2e
disable_ut