c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

dimakr · 2024-09-16T21:20:17Z

At the beginning of disrupt_truncate nemesis the test ks/table are prepared with the c-s command:

< t:2024-09-14 12:46:50,787 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup n=400000 cl=QUORUM -mode native cql3  user=cassandra password=cassandra -schema keyspace=ks_truncate 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5 -transport 'truststore=/etc/scylla/ssl_conf/truststore.jks truststore-password=cassandra' -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.14 -errors skip-unsupported-columns

The command fails with the error:

WARN  [cluster1-nio-worker-5] 2024-09-14 12:46:55,827 RequestHandler.java:303 - Query '[0 bound values] CREATE KEYSPACE IF NOT EXISTS "ks_truncate" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'} AND durable_writes = true;' generated server side warning(s): Tables in this keyspace will be replicated using Tablets and will not support CDC, LWT and counters features. To use CDC, LWT or counters, drop this keyspace and re-create it without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.
WARN  [cluster1-worker-1] 2024-09-14 12:46:56,840 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
WARN  [cluster1-worker-2] 2024-09-14 12:46:57,282 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
java.lang.RuntimeException: Encountered exception creating schema
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:105)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:74)
	at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
	at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
	at org.apache.cassandra.stress.Stress.run(Stress.java:143)
	at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException: Datacenter eastus_nemesis_dc doesn't have enough token-owning nodes for replication_factor=3
	at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:38)
	at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:27)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:310)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
	at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:215)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:94)
        ... 5 more

Even though the c-s command failed the nemesis continues and starts the truncate disruption which fails with:

Command: '/usr/bin/cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=600 --connect-timeout=60 --ssl -e "TRUNCATE ks_truncate.standard1 USING TIMEOUT 600s" 10.0.0.8'
Exit code: 2
Stdout:
Stderr:
Warning: Using a password on the command line interface can be insecure.
Recommendation: use the credentials file to securely provide the password.
<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table standard1"

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-7 (null | 10.0.0.5) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-6 (null | 10.0.0.7) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-5 (null | 10.0.0.14) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-4 (null | 10.0.0.8) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-3 (null | 10.0.0.7) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-2 (null | 10.0.0.6) (shards: 14)
longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-1 (null | 10.0.0.5) (shards: 14)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-09-13T02-56-40 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test
Test id: ce64f53c-084b-4445-8b62-784fa80adf1c
Test name: scylla-master/tier1/longevity-1tb-5days-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-1TB-5days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor ce64f53c-084b-4445-8b62-784fa80adf1c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs ce64f53c-084b-4445-8b62-784fa80adf1c

Logs:

core.scylla-longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-7-2024-09-14_20-04-12.gz - https://storage.cloud.google.com/upload.scylladb.com/core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000./core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000.zst
db-cluster-ce64f53c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/db-cluster-ce64f53c.tar.gz
sct-runner-events-ce64f53c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-runner-events-ce64f53c.tar.gz
sct-ce64f53c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-ce64f53c.log.tar.gz
loader-set-ce64f53c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/loader-set-ce64f53c.tar.gz
monitor-set-ce64f53c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/monitor-set-ce64f53c.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

roydahan · 2024-11-10T17:21:40Z

should have a quick fix to catch it, raise an error and exit the nemesis.

fruch · 2024-11-12T12:56:55Z

_prepare_test_table should check get_results or verify_results output, and raise exception.
replication_factor=3 is used by _prepare_test_table but it should be use the datacenter based, and match it to the number of nodes in that datacenter

fruch · 2024-11-12T13:00:39Z

the 2nd point was addressed in:
dd07de6

timtimb0t · 2024-11-25T14:34:07Z

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-8 (13.60.219.172 | 10.0.1.187) (shards: 2)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-7 (13.40.120.58 | 10.3.2.153) (shards: 2)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-6 (3.8.117.118 | 10.3.0.21) (shards: 14)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-5 (52.56.176.71 | 10.3.3.165) (shards: 14)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-4 (18.171.61.14 | 10.3.3.77) (shards: 14)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-3 (54.246.249.15 | 10.4.3.81) (shards: 14)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-2 (34.244.59.175 | 10.4.0.81) (shards: 14)
multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-1 (108.129.92.105 | 10.4.0.78) (shards: 14)

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test
Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f
Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-multi-dc-rack-aware-with-znode-in-diff_dc.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 6d4393cc-c118-450c-a7d9-76fc5fab9e7f
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 6d4393cc-c118-450c-a7d9-76fc5fab9e7f

Logs:

db-cluster-6d4393cc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/db-cluster-6d4393cc.tar.gz
sct-runner-events-6d4393cc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-runner-events-6d4393cc.tar.gz
sct-6d4393cc.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-6d4393cc.log.tar.gz
loader-set-6d4393cc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/loader-set-6d4393cc.tar.gz
monitor-set-6d4393cc.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/monitor-set-6d4393cc.tar.gz

Jenkins job URL
Argus

soyacz · 2024-11-25T16:40:51Z

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-8 (13.60.219.172 | 10.0.1.187) (shards: 2)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-7 (13.40.120.58 | 10.3.2.153) (shards: 2)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-6 (3.8.117.118 | 10.3.0.21) (shards: 14)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-5 (52.56.176.71 | 10.3.3.165) (shards: 14)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-4 (18.171.61.14 | 10.3.3.77) (shards: 14)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-3 (54.246.249.15 | 10.4.3.81) (shards: 14)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-2 (34.244.59.175 | 10.4.0.81) (shards: 14)

multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-1 (108.129.92.105 | 10.4.0.78) (shards: 14)

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-multi-dc-rack-aware-with-znode-in-diff_dc.yaml

this nemesis is not supported by znodes, fix is on the way: #9342

fruch · 2024-12-03T08:34:11Z

All stress command running from within nemesis code, should be able to check results and fail early on

Abort the nemesis flow early when a stress command fails, as continuing it is often invalid due to the cluster being in an unexpected state. For example, if the _prepare_test_table routine fails, subsequent steps that depend on the test table (or attempt disruptions on top of it) will also fail. This change adds a check for results of stress command triggered within the nemesis code, ensuring the nemesis halts early if the stress command is unsuccessful. Fixes: scylladb#8722

roydahan added the Bug Something isn't working right label Nov 10, 2024

roydahan assigned dimakr Nov 10, 2024

dimakr linked a pull request Dec 18, 2024 that will close this issue

fix(nemesis.py): abort nemesis on stress command failure #9585

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

dimakr commented Sep 16, 2024

Logs:

roydahan commented Nov 10, 2024

fruch commented Nov 12, 2024 •

edited

Loading

fruch commented Nov 12, 2024

timtimb0t commented Nov 25, 2024

Logs:

soyacz commented Nov 25, 2024

Packages

Installation details

fruch commented Dec 3, 2024

c-s fails to prepare test table during disrupt_truncate nemesis, but the test continuous and starts the disruption #8722

c-s fails to prepare test table during disrupt_truncate nemesis, but the test continuous and starts the disruption #8722

Comments

dimakr commented Sep 16, 2024

Installation details

Logs:

roydahan commented Nov 10, 2024

fruch commented Nov 12, 2024 • edited Loading

fruch commented Nov 12, 2024

timtimb0t commented Nov 25, 2024

Packages

Installation details

Logs:

soyacz commented Nov 25, 2024

Packages

Installation details

fruch commented Dec 3, 2024

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

fruch commented Nov 12, 2024 •

edited

Loading