Ignoring raft_topology errors during rolling upgrade #9511

timtimb0t · 2024-12-09T11:08:41Z

Packages

Base Scylla version: 6.2.1-20241106.a3a0ffbcd015 with build-id 94e4419682b4191f7c37e2d6bf02f4fa7988dff3
Target Scylla version (or git commit hash): 6.3.0~dev-20241206.7e2875d6489d with build-id 5227dd2a3fce4d2beb83ec6c17d47ad2e8ba6f5c

Kernel Version: 6.8.0-1017-gcp

Issue description

Such an errors which occur during the rolling upgrade test may be ignored:

2024-12-07 06:44:14.987 <2024-12-07 06:44:12.541>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=afd96633-fe01-4632-a902-62cbc0a645e4: type=RUNTIME_ERROR regex=std::runtime_error line_number=135539 node=rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-3
2024-12-07T06:44:12.541+00:00 rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-3      !ERR | scylla[13464]:  [shard  0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))

Need some SCT workaround to handle/ignore such errors

Impact

No any impact, just enhancement for SCT is needed

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (n2-highmem-32)

Scylla Nodes used in this run:

rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-4 (34.138.125.36 | 10.142.0.229) (shards: 30)
rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-3 (34.73.88.37 | 10.142.0.223) (shards: 30)
rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-2 (34.75.166.53 | 10.142.0.220) (shards: 30)
rolling-upgrade--ubuntu-focal-db-node-0bb842d4-0-1 (34.148.43.129 | 10.142.0.204) (shards: 30)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylladb-6-2-1 (gce: undefined_region)

Test: rolling-upgrade-gce-image-test
Test id: 0bb842d4-dff5-49a9-aa6a-abe328dcd1aa
Test name: scylla-master/rolling-upgrade/rolling-upgrade-gce-image-test
Test method: upgrade_test.UpgradeTest.test_rolling_upgrade
Test config file(s):

rolling-upgrade.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 0bb842d4-dff5-49a9-aa6a-abe328dcd1aa
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0bb842d4-dff5-49a9-aa6a-abe328dcd1aa

Logs:

db-cluster-0bb842d4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/db-cluster-0bb842d4.tar.gz
sct-runner-events-0bb842d4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/sct-runner-events-0bb842d4.tar.gz
sct-0bb842d4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/sct-0bb842d4.log.tar.gz
loader-set-0bb842d4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/loader-set-0bb842d4.tar.gz
monitor-set-0bb842d4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/monitor-set-0bb842d4.tar.gz
ssl-conf-0bb842d4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0bb842d4-dff5-49a9-aa6a-abe328dcd1aa/20241207_101514/ssl-conf-0bb842d4.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2024-12-11T14:01:54Z

Wasn't this addressed by #9352?
cc @enaydanov @aleksbykov

aleksbykov · 2024-12-12T08:09:06Z

This error was not included to ignoring list. need new PR to update.
@timtimb0t , please, add this template and run job several times to catch another possible error messages

roydahan · 2024-12-18T15:42:58Z

@timtimb0t are you working on a "fix" for it?

timtimb0t · 2024-12-18T17:19:40Z

@roydahan, yes the fix itself is ready, testing it

timtimb0t assigned timtimb0t and temichus Dec 9, 2024

timtimb0t added the on_core_qa tasks that should be solved by Core QA team label Dec 11, 2024

aleksbykov unassigned temichus Dec 12, 2024

timtimb0t linked a pull request Dec 18, 2024 that will close this issue

fix(sct lib): raft topology errors during rolling upgrade ignoration … #9584

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring raft_topology errors during rolling upgrade #9511

Ignoring raft_topology errors during rolling upgrade #9511

timtimb0t commented Dec 9, 2024

Logs:

kbr-scylla commented Dec 11, 2024

aleksbykov commented Dec 12, 2024

roydahan commented Dec 18, 2024

timtimb0t commented Dec 18, 2024

Ignoring raft_topology errors during rolling upgrade #9511

Ignoring raft_topology errors during rolling upgrade #9511

Comments

timtimb0t commented Dec 9, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

kbr-scylla commented Dec 11, 2024

aleksbykov commented Dec 12, 2024

roydahan commented Dec 18, 2024

timtimb0t commented Dec 18, 2024