Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (couldn't reach admin endpoint) in ControllerSnapshotPolicyTest.test_upgrade_auto_enable #18802

Closed
vbotbuildovich opened this issue Jun 5, 2024 · 4 comments · Fixed by #21274
Assignees
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 5, 2024

https://buildkite.com/redpanda/vtools/builds/14278

Module: rptest.tests.controller_snapshot_test
Class: ControllerSnapshotPolicyTest
Method: test_upgrade_auto_enable
test_id:    ControllerSnapshotPolicyTest.test_upgrade_auto_enable
status:     FAIL
run time:   165.109 seconds

TimeoutError("couldn't reach admin endpoint for ducktape-node-20-manually-casual-sheepdog")
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/controller_snapshot_test.py", line 99, in test_upgrade_auto_enable
    self.redpanda.wait_for_membership(first_start=False)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2544, in wait_for_membership
    wait_until(lambda: {n
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2544, in <lambda>
    wait_until(lambda: {n
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2546, in <setcomp>
    if self.registered(n)} == self._started,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 4004, in registered
    node_id = self.node_id(node, force_refresh=True)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1471, in node_id
    node_id = wait_until_result(
  File "/home/ubuntu/redpanda/tests/rptest/util.py", line 94, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: couldn't reach admin endpoint for ducktape-node-20-manually-casual-sheepdog

JIRA Link: CORE-3209

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Jun 5, 2024
@vbotbuildovich
Copy link
Collaborator Author

@travisdowns travisdowns changed the title CI Failure (key symptom) in ControllerSnapshotPolicyTest.test_upgrade_auto_enable CI Failure (couldn't reach admin endpoint) in ControllerSnapshotPolicyTest.test_upgrade_auto_enable Jun 23, 2024
@ztlpn
Copy link
Contributor

ztlpn commented Jul 3, 2024

Actual reason is redpanda crash due to the following assert failure:

ERROR 2024-06-18 13:03:38,038 [shard 1:main] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-05d2b322fee6f478f-1/redpanda/redpanda/src/v/cluster/rm_stm.cc:1530) 'false' unsupported tx_snapshot_header version 3

cc @bharathv

@bharathv
Copy link
Contributor

bharathv commented Jul 3, 2024

As I mentioned offline, I think the issue is also partly due to a non linear upgrade path .. 23.1.x -> 23.2.x -> 24.2.x based effectively missing 23.3.x and 24.1.x.

Snapshots are (re)written when a new segment is created, which should naturally happen after upgrade with the term change. In this case 23.3.x and 24.1.x will rewrite snapshots with version=4 that dev (v24.2.x) can handle. Not saying that this is fool proof by any means, one can argue what if the snapshot rewrite fails at the beginning and we immediately proceed to the next upgrade. This is unlikely to happen in practice though? Also I think it is not realistic to maintain support for all historical versions given all the recent code churn.

@mmaslankaprv
Copy link
Member

we agreed to delete the test as it is not longer relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants