Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Manager snapshot preparer helper #9837

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mikliapko
Copy link
Contributor

@mikliapko mikliapko commented Jan 15, 2025

refs:
scylladb/scylla-manager#4190
scylladb/scylla-manager#4208

The PR addresses two main topics:

  1. Manager snapshot preparer helper extension
  • Dataset parameters (which defined in c-s cmd) aren't hardcoded anymore what gives an opportunity to widely vary its properties (replication, compaction, cl, rf, etc).
    manager_restore_benchmark_snapshots.yaml file with snapshots config for current benchmark tests is extended to include such a parameters. These params are later used on C-S read command generation stage.
    Thus, snapshot preparation test becomes widely customizable with more flexibility to adjust dataset.
  • Generated backup snapshot details are sent to Argus Results instead of logging what improves usability.
  1. Cloud clusters support
  • Extension for test_prepare_backup_snapshot that allows to run the test either for standard cluster created via SCT or for clusters created in via siren-tests (Cloud cluster).

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

New approach makes snapshots preparation process is more flexible,
allowing to vary the whole scope of c-s write cmd parameters - full
list of them is provided in manager_snapshots_preparer_config.yaml.

Default values of config yaml parameters can be rewritten by passing
SCT_MGMT_SNAPSHOTS_PREPARER_PARAMS parameter in Jenkins job.
The process of snapshot generation is improved in a way that dataset
params (which defined in c-s cmd) aren't hardcoded anymore what gives
an opportunity to widely vary its properties (replication, compaction,
cl, rf, etc).

manager_restore_benchmark_snapshots.yaml file with snapshots config
for current benchmark tests is extended to include such a parameters.
These params are later used on C-S read command generation stage.

Thus, snapshot preparation test becomes widely customizable with more
flexibility to adjust dataset.
@mikliapko mikliapko added the backport/none Backport is not required label Jan 15, 2025
@mikliapko mikliapko self-assigned this Jan 15, 2025
@mikliapko mikliapko force-pushed the improve-snapshot-preparer-test branch 2 times, most recently from f587573 to de09454 Compare January 15, 2025 15:48
@mikliapko mikliapko marked this pull request as ready for review January 15, 2025 16:55
@mikliapko mikliapko force-pushed the improve-snapshot-preparer-test branch from de09454 to b5c0adf Compare January 17, 2025 14:43
Extension for test_prepare_backup_snapshot that allows to run the test
either for standard cluster created via SCT or for clusters created in
via siren-tests (Cloud cluster).

Changes:
- backup location: for cloud cluster it should be taken from default
automatically scheduled by Manager backup task.
- copy s3 bucket for cloud cluster since the original bucket is
deleted together with cluster while cluster removal.
To improve usability of the snapshots preparer tool, generated backup
snapshot details will be sent to Argus Results, where they can be taken.

It eliminates the need to browse the logfile looking for details.
@mikliapko
Copy link
Contributor Author

@scylladb/qa-maintainers
Could somebody please take a look?

cl=cs_cmd_params.get("cl").lower(),
col_size=cs_cmd_params.get("col_size"),
col_n=cs_cmd_params.get("col_n"),
scylla_version=re.sub(r"[-.]", "_", self.params.get("scylla_version")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong way for getting scylla version.

Using AMI this config option will have None value.
Or it may have latest value.

The get_version_based_on_conf existing method should be used instead:
https://github.com/scylladb/scylla-cluster-tests/blame/8f2162aa/sdcm/sct_config.py#L2708

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got it.

get_version_based_on_conf is called as part on init_and_verify_sct_config where it sets self.scylla_version which I'm trying to retrieve in this code.

Or you mean that get_version_based_on_conf should be called once again:

scylla_version = self.params.get_version_based_on_conf()[0]

because during the very first call in init_and_verify_sct_config it might be missing some information?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param.get('scylla_version') vs param.scylla_version

first is read from what set in configuration

The second is the computed on you mentioned

I agree it's a bit confusing, in PR I was trying to refactor sct_config.oy into using pydantic I was changing the name (but this PR is more complicated then I assumed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got it.

get_version_based_on_conf is called as part on init_and_verify_sct_config where it sets self.scylla_version which I'm trying to retrieve in this code.

Or you mean that get_version_based_on_conf should be called once again:

scylla_version = self.params.get_version_based_on_conf()[0]

because during the very first call in init_and_verify_sct_config it might be missing some information?

Calling param.get('scylla_version') you read the config option value as-is. Not the calculated one.

The scylla_version = self.params.get_version_based_on_conf()[0] is the calculated one, considers multiple cases.
So, I recommend using this one ^.

@@ -1,110 +1,164 @@
bucket: "manager-backup-tests-permanent-snapshots-us-east-1"
cs_read_cmd_template: "cassandra-stress read cl=ONE n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq={sequence_start}..{sequence_end}"
cs_read_cmd_template: "cassandra-stress read cl={cl} n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy={replication},replication_factor={rf}) compaction(strategy={compaction})' -mode cql3 native -rate threads=500 -col 'size=FIXED({col_size}) n=FIXED({col_n})' -pop seq={sequence_start}..{sequence_end}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why threads=500?
The number of threads must be dependent on the loader CPU cores.
Assuming common loader has 8 CPU cores and using 500 threads I expect it to be not efficient enough due to context switches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What approximations/rules can be used to define the optimal number of threads depending on loader CPU cores?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In common case (not related to C-S) it should be like we do it for scylla -> n_cpus - 1 per each 8 CPU cores.
Having 1-to-1 relationship between CPU core and thread allows to avoid context switches (heavy ops).
Having 1-to-many relationship between CPU core and thread guarantees CPU context switches.
It is pretty possible that some app may perform faster using single thread than 2 multi-threads.

So, it is matter of implementation and measurement to answer your question.

Probably, @roydahan or @soyacz may advice something here.

LOCALRUNNER.run(f"aws s3 mb s3://{name} --region {region}")

@staticmethod
def create_s3_bucket_gce(name: str, region: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3 is amazon-specific service name.
In case of Google it is called google storage.
So, I think it is better to remove the terminology confusion here.

LOCALRUNNER.run(f"aws s3 sync s3://{source} s3://{destination} --acl {acl}")

@staticmethod
def sync_s3_buckets_gce(source: str, destination: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@scylladbbot
Copy link

@mikliapko new branch branch-2025.1 was added, please add backport label if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants