Improve Manager snapshot preparer helper #9837

mikliapko · 2025-01-15T14:49:08Z

refs:
scylladb/scylla-manager#4190
scylladb/scylla-manager#4208

The PR addresses two main topics:

Manager snapshot preparer helper extension

Dataset parameters (which defined in c-s cmd) aren't hardcoded anymore what gives an opportunity to widely vary its properties (replication, compaction, cl, rf, etc).
manager_restore_benchmark_snapshots.yaml file with snapshots config for current benchmark tests is extended to include such a parameters. These params are later used on C-S read command generation stage.
Thus, snapshot preparation test becomes widely customizable with more flexibility to adjust dataset.
Generated backup snapshot details are sent to Argus Results instead of logging what improves usability.

Cloud clusters support

Extension for test_prepare_backup_snapshot that allows to run the test either for standard cluster created via SCT or for clusters created in via siren-tests (Cloud cluster).

Testing

Cloud cluster snapshot, Argus (see Results tab)
Regular SCT cluster snapshot creation, Argus (see Results tab)
Restore from snapshot generated with a new approach

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

New approach makes snapshots preparation process is more flexible, allowing to vary the whole scope of c-s write cmd parameters - full list of them is provided in manager_snapshots_preparer_config.yaml. Default values of config yaml parameters can be rewritten by passing SCT_MGMT_SNAPSHOTS_PREPARER_PARAMS parameter in Jenkins job.

The process of snapshot generation is improved in a way that dataset params (which defined in c-s cmd) aren't hardcoded anymore what gives an opportunity to widely vary its properties (replication, compaction, cl, rf, etc). manager_restore_benchmark_snapshots.yaml file with snapshots config for current benchmark tests is extended to include such a parameters. These params are later used on C-S read command generation stage. Thus, snapshot preparation test becomes widely customizable with more flexibility to adjust dataset.

Extension for test_prepare_backup_snapshot that allows to run the test either for standard cluster created via SCT or for clusters created in via siren-tests (Cloud cluster). Changes: - backup location: for cloud cluster it should be taken from default automatically scheduled by Manager backup task. - copy s3 bucket for cloud cluster since the original bucket is deleted together with cluster while cluster removal.

To improve usability of the snapshots preparer tool, generated backup snapshot details will be sent to Argus Results, where they can be taken. It eliminates the need to browse the logfile looking for details.

mikliapko · 2025-01-23T11:21:13Z

@scylladb/qa-maintainers
Could somebody please take a look?

vponomaryov · 2025-01-23T12:14:03Z

mgmt_cli_test.py

+            cl=cs_cmd_params.get("cl").lower(),
+            col_size=cs_cmd_params.get("col_size"),
+            col_n=cs_cmd_params.get("col_n"),
+            scylla_version=re.sub(r"[-.]", "_", self.params.get("scylla_version")),


This is wrong way for getting scylla version.

Using AMI this config option will have None value.
Or it may have latest value.

The get_version_based_on_conf existing method should be used instead:
https://github.com/scylladb/scylla-cluster-tests/blame/8f2162aa/sdcm/sct_config.py#L2708

I'm not sure I got it.

get_version_based_on_conf is called as part on init_and_verify_sct_config where it sets self.scylla_version which I'm trying to retrieve in this code.

Or you mean that get_version_based_on_conf should be called once again:

scylla_version = self.params.get_version_based_on_conf()[0]

because during the very first call in init_and_verify_sct_config it might be missing some information?

param.get('scylla_version') vs param.scylla_version

first is read from what set in configuration

The second is the computed on you mentioned

I agree it's a bit confusing, in PR I was trying to refactor sct_config.oy into using pydantic I was changing the name (but this PR is more complicated then I assumed)

I'm not sure I got it.

get_version_based_on_conf is called as part on init_and_verify_sct_config where it sets self.scylla_version which I'm trying to retrieve in this code.

Or you mean that get_version_based_on_conf should be called once again:

scylla_version = self.params.get_version_based_on_conf()[0]

because during the very first call in init_and_verify_sct_config it might be missing some information?

Calling param.get('scylla_version') you read the config option value as-is. Not the calculated one.

The scylla_version = self.params.get_version_based_on_conf()[0] is the calculated one, considers multiple cases.
So, I recommend using this one ^.

vponomaryov · 2025-01-23T12:25:17Z

defaults/manager_restore_benchmark_snapshots.yaml

@@ -1,110 +1,164 @@
 bucket: "manager-backup-tests-permanent-snapshots-us-east-1"
-cs_read_cmd_template: "cassandra-stress read cl=ONE n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq={sequence_start}..{sequence_end}"
+cs_read_cmd_template: "cassandra-stress read cl={cl} n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy={replication},replication_factor={rf}) compaction(strategy={compaction})' -mode cql3 native -rate threads=500 -col 'size=FIXED({col_size}) n=FIXED({col_n})' -pop seq={sequence_start}..{sequence_end}"


Why threads=500?
The number of threads must be dependent on the loader CPU cores.
Assuming common loader has 8 CPU cores and using 500 threads I expect it to be not efficient enough due to context switches.

What approximations/rules can be used to define the optimal number of threads depending on loader CPU cores?

In common case (not related to C-S) it should be like we do it for scylla -> n_cpus - 1 per each 8 CPU cores.
Having 1-to-1 relationship between CPU core and thread allows to avoid context switches (heavy ops).
Having 1-to-many relationship between CPU core and thread guarantees CPU context switches.
It is pretty possible that some app may perform faster using single thread than 2 multi-threads.

So, it is matter of implementation and measurement to answer your question.

Probably, @roydahan or @soyacz may advice something here.

vponomaryov · 2025-01-23T12:28:37Z

mgmt_cli_test.py

+        LOCALRUNNER.run(f"aws s3 mb s3://{name} --region {region}")
+
+    @staticmethod
+    def create_s3_bucket_gce(name: str, region: str) -> None:


s3 is amazon-specific service name.
In case of Google it is called google storage.
So, I think it is better to remove the terminology confusion here.

vponomaryov · 2025-01-23T12:28:47Z

mgmt_cli_test.py

+        LOCALRUNNER.run(f"aws s3 sync s3://{source} s3://{destination} --acl {acl}")
+
+    @staticmethod
+    def sync_s3_buckets_gce(source: str, destination: str) -> None:


scylladbbot · 2025-01-27T09:54:04Z

@mikliapko new branch branch-2025.1 was added, please add backport label if needed

mikliapko added 2 commits January 10, 2025 14:44

mikliapko added the backport/none Backport is not required label Jan 15, 2025

mikliapko self-assigned this Jan 15, 2025

mikliapko force-pushed the improve-snapshot-preparer-test branch 2 times, most recently from f587573 to de09454 Compare January 15, 2025 15:48

mikliapko requested review from karol-kokoszka, Michal-Leszczynski and a team January 15, 2025 16:55

mikliapko marked this pull request as ready for review January 15, 2025 16:55

Michal-Leszczynski approved these changes Jan 16, 2025

View reviewed changes

mikliapko force-pushed the improve-snapshot-preparer-test branch from de09454 to b5c0adf Compare January 17, 2025 14:43

mikliapko added 2 commits January 17, 2025 15:50

fix(manager): send snapshot details to Argus Results instead of logging

58b8047

To improve usability of the snapshots preparer tool, generated backup snapshot details will be sent to Argus Results, where they can be taken. It eliminates the need to browse the logfile looking for details.

mikliapko force-pushed the improve-snapshot-preparer-test branch from b5c0adf to 58b8047 Compare January 17, 2025 14:50

mikliapko mentioned this pull request Jan 17, 2025

Define siren cluster backup snapshots handling approach scylladb/scylla-manager#4208

Closed

vponomaryov reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Manager snapshot preparer helper #9837

Improve Manager snapshot preparer helper #9837

mikliapko commented Jan 15, 2025 •

edited

Loading

mikliapko commented Jan 23, 2025

vponomaryov Jan 23, 2025

mikliapko Jan 23, 2025

fruch Jan 23, 2025

vponomaryov Jan 23, 2025

vponomaryov Jan 23, 2025

mikliapko Jan 23, 2025

vponomaryov Jan 23, 2025

vponomaryov Jan 23, 2025

vponomaryov Jan 23, 2025

scylladbbot commented Jan 27, 2025

Improve Manager snapshot preparer helper #9837

Are you sure you want to change the base?

Improve Manager snapshot preparer helper #9837

Conversation

mikliapko commented Jan 15, 2025 • edited Loading

Testing

PR pre-checks (self review)

mikliapko commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladbbot commented Jan 27, 2025

mikliapko commented Jan 15, 2025 •

edited

Loading