-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛[bug] Jupyter lab / Tensorboard stuck at Waiting for ... #8198
Comments
I was able to reproduce this problem on my local machine which returned the same problem.
|
hello, the most common cause for this symptom is a problem with master container connecting to the task (notebook/tensorboard) container to proxy the incoming request. this is often caused by firewalls or other networking setup issues.
it's peculiar you see the issue on three different OS'es, and you both see the issue. are you working with one shared deployment? or have you each deployed determined separately on all of these OS'es? can you tell me more on how did you deploy determined / which guide did you follow? (is it if you happen to share a corporate firewall / proxy setup, I'd recommend trying to temporarily disabling it to see if it helps. |
For me I tried it on my private machine: I have no proxy activated. I started the determined cluster using and also started a agent by running the docker-container When I start "Jupyter lab" using the UI - I see the container getting started and all pip-libs are being downloaded but it hangs itself when "Running" is printed on the screen as shown by @jokokojote I'm root on my machine too. |
Hello, I tested it on different machines with different set ups to isolate respectively understand the problem. At first, I indeed tried it on an ubuntu machine inside a cooperate network and run determined using the master, agent and db docker containers diretly (and passed proxy environment variables to the containers). The core functionalities like experiment initlization, (GPU) training, tuning, etc. worked like charm - jupyter and tensorboard did not, yielding the same logs I added in the issue description. Indeed firewall or proxy settings could be the issue here, even though I do not understand why the agent itself worked and no errors were shown in the logs fo tensorboard and jupyter. Since jupyter and tensorboard it did not work on this machine I tried it on my cooperate laptop (Mac) but outside of the cooperate network and set up determined just with Then I asked @KevinHubert-Dev to try it at home with a private machine and private network and he got the same results like he described. |
It is highly unusual to see this happen on so many different setups. I'll need your help debugging it. When you start a notebook, there'd be a "registering service" log line in the master logs, e.g.
you can |
I did what you suggested on my corporate laptop in my private network: Start up with:
Master logs:
Curl inside master gets timeout:
Jupyter container logs:
Docker containers running:
Agent logs:
|
do you have any insight why this does not work? |
Verbose mode did not yield anymore information using curl:
I am not a docker expert, so maybe this is not relevant, but I was wondering why in your example
Containers running after trying to run jupyter:
Inspecting the docker networks showed that only db and master container are in the determined_default network, I don't know if this is intended.
Agent is in host network mode:
Jupyter container is in bridge mode:
|
I was able to repro the issue with as a temporary workaround, I can suggest installing master and agent using linux packages or homebrew which should address that problem by not having master wrapped in docker. |
@jokokojote did you do your last test on macos? or on ubuntu? |
Last test was on macOS. Ob ubuntu I started it with:
After running |
so the ubuntu setup has the proxy configuration. this often causes problems. you'd need to setup otherwise, master and agent has this config, but the spawned containers don't. |
I met the problem almost the same. if master is running on an individual server would it possible to access the registered address that 172.18.0.1? That address is an docker accessable, not LAN wide. |
Sorry, nothing comes to mind. If complex bridge networking is causing issues, you can try switching to host mode networking. Setting up local k8s clusters is also much easier nowadays, so that's another path to consider if you don't want to maintain a raw docker setup. |
I meet the same issue with the same startup. <info> [2024-07-17 09:54:23] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:31] [fe2dc1b5] copying files to container: /run/determined
<info> [2024-07-17 09:54:37] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:43] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:46] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:52] [fe2dc1b5] Resources for JupyterLab (especially-legal-warthog) have started
<warning> [2024-07-17 09:55:00] [fe2dc1b5] Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: detected 0 gpus (nvidia-smi not found)
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: rocm-smi not found
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: detected 0 gpus (nvidia-smi not found)
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: rocm-smi not found
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: Running task container on agent_id=determined-agent-0, hostname=ec3212405320 with visible GPUs []
<info> [2024-07-17 09:55:05] [fe2dc1b5] [26] determined: detected 0 gpu processes (nvidia-smi not found)
<> [2024-07-17 09:55:05] [fe2dc1b5] + test -f /run/determined/dynamic-tcd-startup-hook.sh
<> [2024-07-17 09:55:05] [fe2dc1b5] + test -f startup-hook.sh
<> [2024-07-17 09:55:05] [fe2dc1b5] + set +x
<warning> [2024-07-17 09:55:14] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<warning> [2024-07-17 09:55:22] [fe2dc1b5] [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
<info> [2024-07-17 09:55:23] [fe2dc1b5] [ServerApp] Extension package jupyter_server_terminals took 0.5429s to import
<warning> [2024-07-17 09:55:24] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<info> [2024-07-17 09:55:25] [fe2dc1b5] [ServerApp] Extension package jupyter_server_ydoc took 2.1832s to import
<warning> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_archive | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_fileid | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_terminals | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_ydoc | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyterlab | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] nbclassic | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
<info> [2024-07-17 09:55:34] [fe2dc1b5] [ServerApp] notebook_shim | extension was successfully linked.
<warning> [2024-07-17 09:55:34] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] notebook_shim | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_archive | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] Configured File ID manager: ArbitraryFileIdManager
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Configured root dir: /
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Configured database path: /run/determined/jupyter/data/file_id_manager.db
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Successfully connected to database file.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Creating File ID tables and indices with journal_mode = DELETE
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] Attached event listeners.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_fileid | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_terminals | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_ydoc | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.10/site-packages/jupyterlab
<info> [2024-07-17 09:55:35] [fe2dc1b5] [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyterlab | extension was successfully loaded.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] _ _ _ _
<> [2024-07-17 09:55:35] [fe2dc1b5] | | | |_ __ __| |__ _| |_ ___
<> [2024-07-17 09:55:35] [fe2dc1b5] | |_| | '_ \/ _` / _` | _/ -_)
<> [2024-07-17 09:55:35] [fe2dc1b5] \___/| .__/\__,_\__,_|\__\___|
<> [2024-07-17 09:55:35] [fe2dc1b5] |_|
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] Please note that updating to Notebook 7 might break some of your extensions.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] nbclassic | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Serving notebooks from local directory: /
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Jupyter Server 2.14.1 is running at:
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] https://localhost:3181/proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab?token=...
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] https://127.0.0.1:3181/proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab?token=...
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
<info> [2024-07-17 09:55:35] || INFO: Service of JupyterLab (especially-legal-warthog) is available
<warning> [2024-07-17 09:56:08] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46408): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:09] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46416): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:11] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46420): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:12] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46434): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:12] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46436): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:13] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 34224): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:37] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 52944): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:38] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 52950): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:43] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40750): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:45] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40752): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40764): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] 404 GET / (@172.17.0.1) 154.75ms referer=None
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 13 ('172.17.0.1', 40772): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 13 ('172.17.0.1', 40784): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 14 ('172.17.0.1', 40790): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<info> [2024-07-17 10:01:55] [fe2dc1b5] [LabApp] 302 GET /proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab (@172.17.0.1) 1.60ms
<info> [2024-07-17 10:01:58] [fe2dc1b5] [LabApp] 302 GET /proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab (@172.17.0.1) 1.21ms |
I might figure out why this issue happened.
I found something which is showed below:
❯ docker network inspect determined_default
[
{
"Name": "determined_default",
"Id": "744f81e72ad1f8955e795dbb07b840bfbf60bc77b82051229adf69eb33bd7dca",
"Created": "2024-07-18T07:23:24.691171616Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.27.0.0/16",
"Gateway": "172.27.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"256f0f6b0c8251f9f007843f29dbb27bb616b856d2c1d38988cd8293e9e6bc76": {
"Name": "determined_determined-master_1",
"EndpointID": "724f92566fd74240ccdd44efaef60a7944d0742439983999c0bc0c6c99dac1f6",
"MacAddress": "02:42:ac:1b:00:03",
"IPv4Address": "172.27.0.3/16",
"IPv6Address": ""
},
"912549e04df157dbcd8b1956b49a27c5d28b7920850f97f4d0cc6ea2766c3473": {
"Name": "determined_determined-db_1",
"EndpointID": "3c6f4c2f17749a43526a36fe173260de61b8e0f608670ef7ce087b8710163e49",
"MacAddress": "02:42:ac:1b:00:02",
"IPv4Address": "172.27.0.2/16",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {}
}
] We can see that 172.27.0.1 is the gateway of docker network named # 172.17.0.2 is the IP of JupyterLab container and 172.21.53.125 is the IP of host machine
❯ docker run --rm --network determined_default busybox telnet 172.17.0.2 2925
^Ctelnet: can't connect to remote host (172.17.0.2): Connection timed out
❯ docker run --rm busybox telnet 172.17.0.2 2925
Connected to 172.17.0.2
❯ docker run --rm --network determined_default busybox telnet 172.21.53.125 32807
Connected to 172.21.53.125 So, the conclusion is that there was something wrong with proxy module. It need to redirect it to the right IP and port. Unfortunately, I'm not a golang programmer. Could anyone help to fix this up? |
Describe the bug
I am not sure if this is a bug or I missed some basic config step, but I checked the docs multiple times and did not find any information about this:
Jupyter lab and tensorboard are stuck at "Waiting for ..." after the docker was run successfully w/o any errors shown in the logs.
Tried with 0.26.1, 0.26.0, 0.25.1 and 0.21.2 on MacOS, Ubuntu and Windows.
TensorBoard 0.26.1 logs:
Jupyter 0.26.1 logs:
Jupyter 0.21.2 logs:
Reproduction Steps
det deploy local cluster-up --no-gpu
2.a. Open the UI: Tasks -> launch Jupyter
OR
2.b.1 Run an experiment e.g. gan_mnist_pytorch with
det experiment create const.yaml .
2.b.2 Open the UI, open the experiment, open tensorboard
Expected Behavior
UI for Jupiter lab / tensorboard should open after some (short) waiting time (or a meaningful error message should show up at least).
Screenshot
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: