Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to start a stopped multi-server cluster #1452

Open
cjc7373 opened this issue Jun 11, 2024 · 0 comments
Open

[BUG] Unable to start a stopped multi-server cluster #1452

cjc7373 opened this issue Jun 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@cjc7373
Copy link
Contributor

cjc7373 commented Jun 11, 2024

What did you do

  • How was the cluster created?

    • k3d cluster create multi-2 --servers 5
  • What did you do afterwards?

    • k3d cluster stop multi-2
    • k3d cluster start multi-2

What did you expect to happen

cluster starts normally

Screenshots or terminal output

the process stucks here:

INFO[0001] Starting node 'k3d-multi-2-server-1'
DEBU[0002] Truncated 2024-06-11 08:02:45.185124838 +0000 UTC to 2024-06-11 08:02:45 +0000 UTC
DEBU[0002] Waiting for node k3d-multi-2-server-1 to get ready (Log: 'k3s is up and running')
TRAC[0002] NodeWaitForLogMessage: Node 'k3d-multi-2-server-1' waiting for log message 'k3s is up and running' since '2024-06-11 08:02:45 +0000 UTC'
docker logs k3d-multi-2-server-1 2>&1 | less
time="2024-06-11T08:01:40Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2024-06-11T08:01:40.922826Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"7093034
7b5d5e687","rtt":"0s","error":"dial tcp 172.20.0.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:40.922907Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"369e2fa
3a81b859c","rtt":"0s","error":"dial tcp 172.20.0.7:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:40.922921Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"70930347b5d
5e687","rtt":"0s","error":"dial tcp 172.20.0.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:40.922926Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ef68f96
2f1358108","rtt":"0s","error":"dial tcp 172.20.0.5:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:40.922931Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"369e2fa3a81
b859c","rtt":"0s","error":"dial tcp 172.20.0.7:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:40.952634Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"ef68f962f13
58108","rtt":"0s","error":"dial tcp 172.20.0.5:2380: connect: no route to host"}
Error from server (ServiceUnavailable): apiserver not ready
time="2024-06-11T08:01:42Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
{"level":"info","ts":"2024-06-11T08:01:43.904177Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 is starting a new election at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904256Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 became pre-candidate at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904278Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 received MsgPreVoteResp from b37ae33eccf57701 at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904309Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 [logterm: 3, index: 27609] sent MsgPreVote request to 369e2fa3a81b859c at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904322Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 [logterm: 3, index: 27609] sent MsgPreVote request to 59187da497ed4d13 at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904326Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 [logterm: 3, index: 27609] sent MsgPreVote request to 70930347b5d5e687 at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.90433Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 [logterm: 3, index: 27609] sent MsgPreVote request to ef68f962f1358108 at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.904877Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 received MsgPreVoteResp from 59187da497ed4d13 at term 3"}
{"level":"info","ts":"2024-06-11T08:01:43.9049Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 has received 2 MsgPreVoteResp votes and 0 vote rejections"}
{"level":"info","ts":"2024-06-11T08:01:44.19295Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b37ae33eccf57701 [logterm: 3, index: 27609, vote: 59187da497ed4d13] cast MsgPreVote for 59187da497ed4d13 [logterm: 3, index: 27609] at term 3"}
Error from server (ServiceUnavailable): apiserver not ready
{"level":"warn","ts":"2024-06-11T08:01:45.195281Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"b37ae33eccf57701","local-member-attributes":"{Name:k3d-multi-2-server-1-41c018be ClientURLs:[https://172.20.0.4:2379]}","request-path":"/0/members/b37ae33eccf57701/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
{"level":"warn","ts":"2024-06-11T08:01:45.923723Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"369e2fa3a81b859c","rtt":"0s","error":"dial tcp 172.20.0.7:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:45.923836Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"70930347b5d5e687","rtt":"0s","error":"dial tcp 172.20.0.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:45.923792Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"369e2fa3a81b859c","rtt":"0s","error":"dial tcp 172.20.0.7:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:45.923829Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"70930347b5d5e687","rtt":"0s","error":"dial tcp 172.20.0.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:45.923852Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ef68f962f1358108","rtt":"0s","error":"dial tcp 172.20.0.5:2380: connect: no route to host"}
{"level":"warn","ts":"2024-06-11T08:01:45.952734Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"ef68f962f13
58108","rtt":"0s","error":"dial tcp 172.20.0.5:2380: connect: no route to host"}

Additional information

This issue is similar to #452, only to have more servers.

I tried to run the test_multi_master_start_stop.sh e2e test with the change that increases server number from 3 to 5. It also failed.

make e2e E2E_INCLUDE=test_multi_master_start_stop
Running e2e tests
LOG_LEVEL="WARN" E2E_INCLUDE="test_multi_master_start_stop" E2E_EXCLUDE="" E2E_EXTRA="" E2E_RUNNER_START_TIMEOUT=10 E2E_HELPER_IMAGE_TAG="" E2E_KEEP="" E2E_PARALLEL="" E2E_DIND_VERSION="" E2E_K3S_VERSION="" E2E_FAIL_FAST="" tests/dind.sh "5.6.3-30-g5028a65c"
2024-06-11 08:54:13 UTC [DIND] >>> Building container image k3d:5.6.3-30-g5028a65c-dind
sha256:188161529a2cad1d30f050b199c8ff8f2cdd636b4cd08d13eaf49e7f6df24a3a
2024-06-11 08:55:29 UTC [DIND] >>> Starting E2E Test Runner container k3d-e2e-runner-240611165529
true
true
true
true
true
true
/bin/bash: line 1: go: command not found
/bin/bash: line 1: go: command not found
/bin/bash: line 1: golangci-lint: command not found
Building docker image ghcr.io/k3d-io/k3d-proxy:5.6.3-30-g5028a65c
DOCKER_BUILDKIT=1 docker build --quiet --no-cache proxy/ -f proxy/Dockerfile -t ghcr.io/k3d-io/k3d-proxy:5.6.3-30-g5028a65c
Building docker image ghcr.io/k3d-io/k3d-tools:5.6.3-30-g5028a65c
DOCKER_BUILDKIT=1 docker build --quiet --no-cache tools/ -f tools/Dockerfile -t ghcr.io/k3d-io/k3d-tools:5.6.3-30-g5028a65c --build-arg GIT_TAG=v5.6.3-30-g5028a65c
sha256:8451944beeefaedaa50a1bced69ccc4abb22dd2d3cbba42558614429cf02b74f
sha256:2f7ee9fe38e4e45b6069f6a358aaee2719b8a2511c79a47c48cb2a9004f1f774
Start time outside runner: Tue Jun 11 16:56:45 CST 2024
2024-06-11 08:56:45 UTC [Runner] >>> Checking test environment...
arch: aarch64
cgroupdriver: cgroupfs
cgroupversion: "2"
endpoint: /var/run/docker.sock
filesystem: extfs
infoname: 498900b9097b
name: docker
os: Alpine Linux v3.19 (containerized)
ostype: linux
version: 25.0.3
2024-06-11 08:56:45 UTC [Runner] >>> Preparing filesystem and environment...
2024-06-11 08:56:45 UTC [Runner] >>> Start time inside runner: Tue Jun 11 08:56:45 UTC 2024
2024-06-11 08:56:45 UTC [Runner] >>> ######################################
2024-06-11 08:56:45 UTC [Runner] >>> # BASIC TESTS (/src/tests/test_*.sh) #
2024-06-11 08:56:45 UTC [Runner] >>> ######################################
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_basic *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_config_file *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_config_file_from_stdin *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_config_file_migration *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_config_with_overrides *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_full_lifecycle *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_ipam *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_loadbalancer *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_memory_limits *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_multi_master *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_node_edit *****
2024-06-11 08:56:45 UTC [Runner] >>> ***** Skipping test_registry *****
2024-06-11 08:56:45 UTC [Runner] >>> Running 1 tests in 1 batches à max. 4 tests.
2024-06-11 08:56:45 UTC [Runner] >>> Starting test batch #1/1...
2024-06-11 08:56:45 UTC [Runner] >>> ***** Running test_multi_master_start_stop *****
2024-06-11 09:05:13 UTC [Runner] >>> WARN: FAILED LOGS: /root/testoutput/test_multi_master_start_stop.sh.log.failed
2024-06-11 09:05:13 UTC [Runner] >>> FAILED -> /root/testoutput/test_multi_master_start_stop.sh.log.failed
2024-06-11 09:05:13 UTC [Runner] >>> ***************************************************
2024-06-11 09:05:13 UTC [Runner] >>> *** Failed Test test_multi_master_start_stop
2024-06-11 09:05:13 UTC [Runner] >>> ***************************************************
2024-06-11 08:56:45 UTC [Test | multi-server-start-stop | ] >>> Creating cluster multiserverstartstop  ...
kubectl cluster-info
2024-06-11 08:58:38 UTC [Test | multi-server-start-stop | ] >>> Checking that we have access to the cluster...
/root/.config/k3d/kubeconfig-multiserverstartstop.yaml
Kubernetes control plane is running at https://0.0.0.0:38773
CoreDNS is running at https://0.0.0.0:38773/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://0.0.0.0:38773/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
2024-06-11 08:58:38 UTC [Test | multi-server-start-stop | ] >>> cluster multiserverstartstop is reachable
2024-06-11 08:58:38 UTC [Test | multi-server-start-stop | ] >>> Sleeping for 5 seconds to give the cluster enough time to get ready...
2024-06-11 08:58:43 UTC [Test | multi-server-start-stop | ] >>> Checking that we have 3 server nodes online...
/root/.config/k3d/kubeconfig-multiserverstartstop.yaml
2024-06-11 08:58:43 UTC [Test | multi-server-start-stop | ] >>> cluster multiserverstartstop has 5 nodes, as expected
2024-06-11 08:58:43 UTC [Test | multi-server-start-stop | ] >>> Stopping cluster...
2024-06-11 08:59:08 UTC [Test | multi-server-start-stop | ] >>> Waiting for a bit...
2024-06-11 08:59:13 UTC [Test | multi-server-start-stop | ] >>> Restarting cluster (time: 2024-06-11 08:59:13 UTC)...
FATA[2024-06-11T09:05:13Z] Failed to start server k3d-multiserverstartstop-server-1: Node k3d-multiserverstartstop-server-1 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-multiserverstartstop-server-1': stopped returning log lines: context deadline exceeded
2024-06-11 09:05:13 UTC [Test | multi-server-start-stop | ] >>> failed to restart cluster (timeout 360s)
2024-06-11 09:05:13 UTC [Test | multi-server-start-stop | ] >>> FATAL: Test | multi-server-start-stop | : test failed
2024-06-11 09:05:13 UTC [Runner] >>> FINISHED BASIC TESTS
  > Total:	13
  > Run:	1
  > Skipped:	12
  > Passed:	0
  > Failed:	1
2024-06-11 09:05:13 UTC [Runner] >>> WARN: Failed BASIC TESTS: 1
2024-06-11 09:05:13 UTC [DIND] >>> Cleaning up test container k3d-e2e-runner-240611165529
498900b9097bed1974a79b6676f78c59e7ff32dba38d7bbb927ef1f7cd33c0cf
make: *** [e2e] Error 1

I have skimmed the patch in #467 and suspect the cause is not enough nodes to reach quorum. Because server number is 5, quorum is 3. And when starting the second node, etcd can't get enough vote, so the start process hungs.

Which OS & Architecture

MacOS with Docker Desktop

arch: aarch64
cgroupdriver: cgroupfs
cgroupversion: "2"
endpoint: /var/run/docker.sock
filesystem: extfs
infoname: docker-desktop
name: docker
os: Docker Desktop
ostype: linux
version: 26.1.1

Which version of k3d

k3d version v5.6.3
k3s version v1.28.8-k3s1 (default)

Which version of docker

docker version
Client:
 Cloud integration: v1.0.35+desktop.13
 Version:           26.1.1
 API version:       1.45
 Go version:        go1.21.9
 Git commit:        4cf5afa
 Built:             Tue Apr 30 11:44:56 2024
 OS/Arch:           darwin/arm64
 Context:           desktop-linux

Server: Docker Desktop 4.30.0 (149282)
 Engine:
  Version:          26.1.1
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.9
  Git commit:       ac2de55
  Built:            Tue Apr 30 11:48:04 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.31
  GitCommit:        e377cd56a71523140ca6ae87e30244719194a521
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
@cjc7373 cjc7373 added the bug Something isn't working label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant