Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA #20210102.1 Pipeline Failure #441

Open
xpillons opened this issue Jan 4, 2021 · 4 comments
Open

NVIDIA #20210102.1 Pipeline Failure #441

xpillons opened this issue Jan 4, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@xpillons
Copy link
Collaborator

xpillons commented Jan 4, 2021

@xpillons xpillons added the bug Something isn't working label Jan 4, 2021
@xpillons xpillons changed the title NVIDIA Pipeline Failure NVIDIA #20210102.1 Pipeline Failure Jan 4, 2021
@xpillons
Copy link
Collaborator Author

xpillons commented Jan 4, 2021

Manually reran the pipeline. Gen2 passed.
Gen1 failed with error
Resource : gpumaster - OSProvisioningTimedOut
Message : OS Provisioning for VM 'gpumaster' did not finish in the
allotted time. The VM may still finish provisioning
successfully. Please check provisioning state later. For
details on how to check current provisioning state of
Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and
Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.
None
Allocating NV12s_v3 is taking too long

@garvct
Copy link
Collaborator

garvct commented Jan 25, 2021

@xpillons, got a similar failure today running the nvidia pipiline.
https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10563&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=280
The time difference between "build install scripts" and the rsync error was only 2 seconds. The error is a connection refused. I believe we already check thad sshd is running before trying to connect, but this does not fix the problem. If there is not a quick fix for this (i.e some additional flag), then maybe it would be worth the time to re-architect this (i.e. replace rsync with something else?). This type of error is occurring too often.

@xpillons
Copy link
Collaborator Author

@edwardsp can you have a look to check why the prsync is failing ? I can see in the code that ssh is tested upfront, but I'm not 100% sure about the sequence. Otherwise maybe we should add a retry in the rsyn python wrapper function

@edwardsp
Copy link
Collaborator

ssh isn't tested before the initial rsync so I have just added a PR to add a test for ssh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants