Commit 77ec968
fix timeout error on TCPStore creation (#147)
Summary:
Pull Request resolved: #147
Async torchsnapshotting was causing a timeout error when creating the TCPStore within PendingSnapshot on an ondemand
Using `torch.distributed.elastic.utils.distributed's` `get_socket_with_port()` resolves the issue.
Reviewed By: daniellepintz
Differential Revision: D48072665
fbshipit-source-id: a573a146f33ecec5f91ed800984e9b0f95cb29741 parent 71965ba commit 77ec968
1 file changed
+3
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
9 | 8 | | |
10 | 9 | | |
11 | 10 | | |
12 | 11 | | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
| 66 | + | |
| 67 | + | |
70 | 68 | | |
71 | 69 | | |
72 | 70 | | |
| |||
0 commit comments