Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(backup): Measure read/write latency during back up #9605

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kreuzerkrieg
Copy link
Contributor

@kreuzerkrieg kreuzerkrieg commented Dec 23, 2024

RW latency during backup using rclone

Test requirements

  • 50% reads and 50% writes.
  • Compaction enabled.
  • 75%-85% CPU utilization (see discussion)
  • Expect up to 6-7ms P99 read latencies (see discussion)
  • Expect not more than 10ms read latency during backup

Argus run
https://argus.scylladb.com/tests/scylla-cluster-tests/b6a76dc0-a6a8-47b8-9dc0-5eff576866e5
Results:

P90 write [ms] P90 read [ms] P99 write [ms] P99 read [ms] Throughput write [op/s] Throughput read [op/s] duration [HH:MM:SS] start time Overview QA dashboard
Cycle #​1 $${\color{green}1.71}$$ $${\color{green}1.89}$$ $${\color{red}1719.66}$$ $${\color{red}1737.49}$$ 124,209 124,233 $${\color{green}00:10:14}$$ 10:14:50 view view
Cycle #​2 $${\color{green}0.60}$$ $${\color{green}0.69}$$ $${\color{green}5.51}$$ $${\color{green}5.74}$$ 124,306 124,365 $${\color{green}00:10:14}$$ 10:36:23 view view
backup time [s]
Backup 00:08:51
Backup during read stress 00:17:28

We can safely say that nor read neither write are affected by backup

@bhalevy / @regevran FYI

@kreuzerkrieg kreuzerkrieg added the backport/none Backport is not required label Dec 23, 2024
@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from a3d1d6d to f003207 Compare December 23, 2024 11:52
@kreuzerkrieg kreuzerkrieg marked this pull request as ready for review December 23, 2024 12:33
@mikliapko mikliapko requested review from a team and removed request for rayakurl December 23, 2024 13:58
@mikliapko
Copy link
Contributor

Added qa-mainteiners to Reviewers

mgmt_cli_test.py Outdated Show resolved Hide resolved
configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved
configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
@soyacz
Copy link
Contributor

soyacz commented Dec 23, 2024

If you have expectations for the metrics (like p99's below 8ms) consider using ValidationRule(fixed_limit=8) similarly for other metrics. See sdcm.argus_results.ManagerRestoreBanchmarkResult for example.

@kreuzerkrieg
Copy link
Contributor Author

If you have expectations for the metrics (like p99's below 8ms) consider using ValidationRule(fixed_limit=8) similarly for other metrics. See sdcm.argus_results.ManagerRestoreBanchmarkResult for example.

I strongly oppose this type of validation as it may cause more harm than good. Occasionally, factors like "noisy neighbors" can cause failures, leading people to spend days troubleshooting. Moreover, this validation is specific to a particular type of AWS instance. Using any other instance type will result in failure.

@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from 70d7c22 to 22c48dc Compare December 29, 2024 14:40
Copy link
Contributor

@roydahan roydahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about the stress commands.
+
It needs to have a jenkins pipeline.

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved
configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
mgmt_cli_test.py Outdated Show resolved Hide resolved
@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from bc63818 to 848b5e2 Compare December 30, 2024 17:12
@kreuzerkrieg
Copy link
Contributor Author

See my comments about the stress commands. + It needs to have a jenkins pipeline.

What we are going to achieve by adding (clonning) a new pipeline?

@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from b9ef7fb to bb7483e Compare December 30, 2024 17:30
@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from e90ffb6 to b5e40e0 Compare December 30, 2024 19:59
"cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=52428801..78643200",
"cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=78643201..104857600" ]

stress_cmd: "cassandra-stress mixed cl=QUORUM duration=10m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=100 fixed=17000/s' -col 'size=FIXED(1024) n=FIXED(1)'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you base the "17000/s" on?
A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s.
4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput.
(BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).

lastly, I think that 10m may be too short to get consistent results.

Copy link
Contributor Author

@kreuzerkrieg kreuzerkrieg Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you base the "17000/s" on? A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s. 4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput. (BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).

ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?

lastly, I think that 10m may be too short to get consistent results.

how long it will be considered OK-ish?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?

Is your target to measure latency during the backup operation or to measure duration of the operation under a specific load?
If you want to measure latency, you don't know the latency :)
We usually measure the latency under a workload which is 50% of max throughput.
in your case it would be something like "-rate 'threads=620 fixed=62500/s' "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long it will be considered OK-ish?

How long does the backup operation take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long it will be considered OK-ish?

How long does the backup operation take?

10min

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of this should be probably under performance and not under manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bring me to the question of, why didn't you just replicate the same test we have for latency_with_operations and replaced the operation with a manager backup nemesis.
If needed, you could just define new manager backup nemesis.

@regevran
Copy link

regevran commented Jan 5, 2025

Please note that this test is to create a baseline, something to compare to when we run the same test but based on Scylla core backup capabilities.
I prefer focusing on making sure that this test is OK in what it tests and that it can be duplicated and changed to core-scylla test.

@regevran regevran closed this Jan 5, 2025
@regevran regevran reopened this Jan 5, 2025
* 50% reads and 50% writes.
* Compaction enabled.
* 75%-85% CPU utilization
* Expect up to 6-7ms P99 read latencies.
* Expect not more than 10ms read latency during backup
@kreuzerkrieg kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from c9c7c9a to fd413b0 Compare January 5, 2025 13:08
@soyacz
Copy link
Contributor

soyacz commented Jan 7, 2025

@regevran see you can apply validation rules that will fail the test if value exceeds error threshold (configurable, see validation rules examples in other tests). In case of setting custom rules for latency_decorator_calculator, see example .

btw. after rebase you can remove mixed word from the test name and add workload_name: mixed to config.

@scylladbbot
Copy link

@kreuzerkrieg new branch branch-2025.1 was added, please add backport label if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants