Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c-s fails to prepare test table during disrupt_truncate nemesis, but the test continuous and starts the disruption #8722

Open
dimakr opened this issue Sep 16, 2024 · 6 comments · May be fixed by #9585
Assignees
Labels
Bug Something isn't working right

Comments

@dimakr
Copy link
Contributor

dimakr commented Sep 16, 2024

At the beginning of disrupt_truncate nemesis the test ks/table are prepared with the c-s command:

< t:2024-09-14 12:46:50,787 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup n=400000 cl=QUORUM -mode native cql3  user=cassandra password=cassandra -schema keyspace=ks_truncate 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5 -transport 'truststore=/etc/scylla/ssl_conf/truststore.jks truststore-password=cassandra' -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.14 -errors skip-unsupported-columns

The command fails with the error:

WARN  [cluster1-nio-worker-5] 2024-09-14 12:46:55,827 RequestHandler.java:303 - Query '[0 bound values] CREATE KEYSPACE IF NOT EXISTS "ks_truncate" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'} AND durable_writes = true;' generated server side warning(s): Tables in this keyspace will be replicated using Tablets and will not support CDC, LWT and counters features. To use CDC, LWT or counters, drop this keyspace and re-create it without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.
WARN  [cluster1-worker-1] 2024-09-14 12:46:56,840 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
WARN  [cluster1-worker-2] 2024-09-14 12:46:57,282 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
java.lang.RuntimeException: Encountered exception creating schema
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:105)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:74)
	at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
	at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
	at org.apache.cassandra.stress.Stress.run(Stress.java:143)
	at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException: Datacenter eastus_nemesis_dc doesn't have enough token-owning nodes for replication_factor=3
	at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:38)
	at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:27)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:310)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
	at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:215)
	at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:94)
        ... 5 more         

Even though the c-s command failed the nemesis continues and starts the truncate disruption which fails with:

Command: '/usr/bin/cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=600 --connect-timeout=60 --ssl -e "TRUNCATE ks_truncate.standard1 USING TIMEOUT 600s" 10.0.0.8'
Exit code: 2
Stdout:
Stderr:
Warning: Using a password on the command line interface can be insecure.
Recommendation: use the credentials file to securely provide the password.
<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table standard1"

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-7 (null | 10.0.0.5) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-6 (null | 10.0.0.7) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-5 (null | 10.0.0.14) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-4 (null | 10.0.0.8) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-3 (null | 10.0.0.7) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-2 (null | 10.0.0.6) (shards: 14)
  • longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-1 (null | 10.0.0.5) (shards: 14)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-09-13T02-56-40 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test
Test id: ce64f53c-084b-4445-8b62-784fa80adf1c
Test name: scylla-master/tier1/longevity-1tb-5days-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor ce64f53c-084b-4445-8b62-784fa80adf1c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs ce64f53c-084b-4445-8b62-784fa80adf1c

Logs:

Jenkins job URL
Argus

@roydahan roydahan added the Bug Something isn't working right label Nov 10, 2024
@roydahan
Copy link
Contributor

should have a quick fix to catch it, raise an error and exit the nemesis.

@fruch
Copy link
Contributor

fruch commented Nov 12, 2024

  • _prepare_test_table should check get_results or verify_results output, and raise exception.
  • replication_factor=3 is used by _prepare_test_table but it should be use the datacenter based, and match it to the number of nodes in that datacenter

@fruch
Copy link
Contributor

fruch commented Nov 12, 2024

the 2nd point was addressed in:
dd07de6

@timtimb0t
Copy link
Contributor

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-8 (13.60.219.172 | 10.0.1.187) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-7 (13.40.120.58 | 10.3.2.153) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-6 (3.8.117.118 | 10.3.0.21) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-5 (52.56.176.71 | 10.3.3.165) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-4 (18.171.61.14 | 10.3.3.77) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-3 (54.246.249.15 | 10.4.3.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-2 (34.244.59.175 | 10.4.0.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-1 (108.129.92.105 | 10.4.0.78) (shards: 14)

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test
Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f
Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 6d4393cc-c118-450c-a7d9-76fc5fab9e7f
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 6d4393cc-c118-450c-a7d9-76fc5fab9e7f

Logs:

Jenkins job URL
Argus

@soyacz
Copy link
Contributor

soyacz commented Nov 25, 2024

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-8 (13.60.219.172 | 10.0.1.187) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-7 (13.40.120.58 | 10.3.2.153) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-6 (3.8.117.118 | 10.3.0.21) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-5 (52.56.176.71 | 10.3.3.165) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-4 (18.171.61.14 | 10.3.3.77) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-3 (54.246.249.15 | 10.4.3.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-2 (34.244.59.175 | 10.4.0.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-1 (108.129.92.105 | 10.4.0.78) (shards: 14)

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

this nemesis is not supported by znodes, fix is on the way: #9342

@fruch
Copy link
Contributor

fruch commented Dec 3, 2024

All stress command running from within nemesis code, should be able to check results and fail early on

dimakr added a commit to dimakr/scylla-cluster-tests that referenced this issue Dec 18, 2024
Abort the nemesis flow early when a stress command fails, as continuing it is
often invalid due to the cluster being in an unexpected state. For example, if
the _prepare_test_table routine fails, subsequent steps that depend on the test
table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis
code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: scylladb#8722
@dimakr dimakr linked a pull request Dec 18, 2024 that will close this issue
5 tasks
dimakr added a commit to dimakr/scylla-cluster-tests that referenced this issue Dec 20, 2024
Abort the nemesis flow early when a stress command fails, as continuing it is
often invalid due to the cluster being in an unexpected state. For example, if
the _prepare_test_table routine fails, subsequent steps that depend on the test
table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis
code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: scylladb#8722
dimakr added a commit to dimakr/scylla-cluster-tests that referenced this issue Dec 20, 2024
Abort the nemesis flow early when a stress command fails, as continuing it is
often invalid due to the cluster being in an unexpected state. For example, if
the _prepare_test_table routine fails, subsequent steps that depend on the test
table (or attempt disruptions on top of it) will also fail.

This change adds a check for results of stress command triggered within the nemesis
code, ensuring the nemesis halts early if the stress command is unsuccessful.

Fixes: scylladb#8722
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants