test(backup): Measure read/write latency during back up #9605

kreuzerkrieg · 2024-12-23T11:35:11Z

RW latency during backup using rclone

Test requirements

50% reads and 50% writes.
Compaction enabled.
75%-85% CPU utilization (see discussion)
Expect up to 6-7ms P99 read latencies (see discussion)
Expect not more than 10ms read latency during backup

Argus run
https://argus.scylladb.com/tests/scylla-cluster-tests/b6a76dc0-a6a8-47b8-9dc0-5eff576866e5
Results:

	P90 write [ms]	P90 read [ms]	P99 write [ms]	P99 read [ms]	Throughput write [op/s]	Throughput read [op/s]	duration [HH:MM:SS]	start time	Overview	QA dashboard
Cycle #1	$${\color{green}1.71}$$	$${\color{green}1.89}$$	$${\color{red}1719.66}$$	$${\color{red}1737.49}$$	124,209	124,233	$${\color{green}00:10:14}$$	10:14:50	view	view
Cycle #2	$${\color{green}0.60}$$	$${\color{green}0.69}$$	$${\color{green}5.51}$$	$${\color{green}5.74}$$	124,306	124,365	$${\color{green}00:10:14}$$	10:36:23	view	view

	backup time [s]
Backup	00:08:51
Backup during read stress	00:17:28

We can safely say that nor read neither write are affected by backup

@bhalevy / @regevran FYI

mikliapko · 2024-12-23T13:59:02Z

Added qa-mainteiners to Reviewers

mgmt_cli_test.py

configurations/manager/100GB_dataset.yaml

mgmt_cli_test.py

soyacz · 2024-12-23T15:22:06Z

If you have expectations for the metrics (like p99's below 8ms) consider using ValidationRule(fixed_limit=8) similarly for other metrics. See sdcm.argus_results.ManagerRestoreBanchmarkResult for example.

kreuzerkrieg · 2024-12-29T14:39:26Z

If you have expectations for the metrics (like p99's below 8ms) consider using ValidationRule(fixed_limit=8) similarly for other metrics. See sdcm.argus_results.ManagerRestoreBanchmarkResult for example.

I strongly oppose this type of validation as it may cause more harm than good. Occasionally, factors like "noisy neighbors" can cause failures, leading people to spend days troubleshooting. Moreover, this validation is specific to a particular type of AWS instance. Using any other instance type will result in failure.

roydahan

See my comments about the stress commands.
+
It needs to have a jenkins pipeline.

configurations/manager/100GB_dataset.yaml

mgmt_cli_test.py

configurations/manager/100GB_dataset.yaml

kreuzerkrieg · 2024-12-30T17:17:39Z

See my comments about the stress commands. + It needs to have a jenkins pipeline.

What we are going to achieve by adding (clonning) a new pipeline?

roydahan · 2024-12-31T13:24:29Z

test-cases/manager/manager-backup-restore-baseline.yaml

+                     "cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native  -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=52428801..78643200",
+                     "cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native  -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=78643201..104857600" ]
+
+stress_cmd: "cassandra-stress mixed cl=QUORUM duration=10m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native  -rate 'threads=100 fixed=17000/s' -col 'size=FIXED(1024) n=FIXED(1)'"


What did you base the "17000/s" on?
A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s.
4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput.
(BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).

lastly, I think that 10m may be too short to get consistent results.

What did you base the "17000/s" on? A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s. 4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput. (BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).

ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?

lastly, I think that 10m may be too short to get consistent results.

how long it will be considered OK-ish?

ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?

Is your target to measure latency during the backup operation or to measure duration of the operation under a specific load?
If you want to measure latency, you don't know the latency :)
We usually measure the latency under a workload which is 50% of max throughput.
in your case it would be something like "-rate 'threads=620 fixed=62500/s' "

how long it will be considered OK-ish?

How long does the backup operation take?

how long it will be considered OK-ish?

How long does the backup operation take?

10min

roydahan · 2024-12-31T13:24:57Z

test-cases/manager/manager-backup-restore-baseline.yaml

The location of this should be probably under performance and not under manager.

Which bring me to the question of, why didn't you just replicate the same test we have for latency_with_operations and replaced the operation with a manager backup nemesis.
If needed, you could just define new manager backup nemesis.

regevran · 2025-01-05T07:37:17Z

Please note that this test is to create a baseline, something to compare to when we run the same test but based on Scylla core backup capabilities.
I prefer focusing on making sure that this test is OK in what it tests and that it can be duplicated and changed to core-scylla test.

* 50% reads and 50% writes. * Compaction enabled. * 75%-85% CPU utilization * Expect up to 6-7ms P99 read latencies. * Expect not more than 10ms read latency during backup

soyacz · 2025-01-07T09:11:30Z

@regevran see you can apply validation rules that will fail the test if value exceeds error threshold (configurable, see validation rules examples in other tests). In case of setting custom rules for latency_decorator_calculator, see example .

btw. after rebase you can remove mixed word from the test name and add workload_name: mixed to config.

scylladbbot · 2025-01-27T09:54:17Z

@kreuzerkrieg new branch branch-2025.1 was added, please add backport label if needed

github-actions bot assigned kreuzerkrieg Dec 23, 2024

kreuzerkrieg added the backport/none Backport is not required label Dec 23, 2024

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from a3d1d6d to f003207 Compare December 23, 2024 11:52

kreuzerkrieg marked this pull request as ready for review December 23, 2024 12:33

kreuzerkrieg requested review from rayakurl and mikliapko as code owners December 23, 2024 12:33

mikliapko requested review from a team and removed request for rayakurl December 23, 2024 13:58

mikliapko requested changes Dec 23, 2024

View reviewed changes

mgmt_cli_test.py Outdated Show resolved Hide resolved

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

mgmt_cli_test.py Outdated Show resolved Hide resolved

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from 70d7c22 to 22c48dc Compare December 29, 2024 14:40

kreuzerkrieg requested a review from mikliapko December 29, 2024 15:15

roydahan requested changes Dec 29, 2024

View reviewed changes

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

soyacz reviewed Dec 30, 2024

View reviewed changes

mgmt_cli_test.py Outdated Show resolved Hide resolved

soyacz reviewed Dec 30, 2024

View reviewed changes

mgmt_cli_test.py Outdated Show resolved Hide resolved

soyacz reviewed Dec 30, 2024

View reviewed changes

configurations/manager/100GB_dataset.yaml Outdated Show resolved Hide resolved

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from bc63818 to 848b5e2 Compare December 30, 2024 17:12

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from b9ef7fb to bb7483e Compare December 30, 2024 17:30

kreuzerkrieg requested review from roydahan, bhalevy and soyacz December 30, 2024 17:31

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from e90ffb6 to b5e40e0 Compare December 30, 2024 19:59

roydahan reviewed Dec 31, 2024

View reviewed changes

regevran closed this Jan 5, 2025

regevran reopened this Jan 5, 2025

test(backup): Measure read/write latency during back up

fd413b0

* 50% reads and 50% writes. * Compaction enabled. * 75%-85% CPU utilization * Expect up to 6-7ms P99 read latencies. * Expect not more than 10ms read latency during backup

kreuzerkrieg force-pushed the backup-baseline-with-readwrite branch from c9c7c9a to fd413b0 Compare January 5, 2025 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(backup): Measure read/write latency during back up #9605

test(backup): Measure read/write latency during back up #9605

kreuzerkrieg commented Dec 23, 2024 •

edited

Loading

mikliapko commented Dec 23, 2024

soyacz commented Dec 23, 2024

kreuzerkrieg commented Dec 29, 2024

roydahan left a comment

kreuzerkrieg commented Dec 30, 2024

roydahan Dec 31, 2024

kreuzerkrieg Jan 2, 2025 •

edited

Loading

roydahan Jan 2, 2025

roydahan Jan 2, 2025

kreuzerkrieg Jan 5, 2025

roydahan Dec 31, 2024

roydahan Dec 31, 2024

regevran commented Jan 5, 2025 •

edited

Loading

soyacz commented Jan 7, 2025

scylladbbot commented Jan 27, 2025

test(backup): Measure read/write latency during back up #9605

Are you sure you want to change the base?

test(backup): Measure read/write latency during back up #9605

Conversation

kreuzerkrieg commented Dec 23, 2024 • edited Loading

mikliapko commented Dec 23, 2024

soyacz commented Dec 23, 2024

kreuzerkrieg commented Dec 29, 2024

roydahan left a comment

Choose a reason for hiding this comment

kreuzerkrieg commented Dec 30, 2024

roydahan Dec 31, 2024

Choose a reason for hiding this comment

kreuzerkrieg Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

roydahan Jan 2, 2025

Choose a reason for hiding this comment

roydahan Jan 2, 2025

Choose a reason for hiding this comment

kreuzerkrieg Jan 5, 2025

Choose a reason for hiding this comment

roydahan Dec 31, 2024

Choose a reason for hiding this comment

roydahan Dec 31, 2024

Choose a reason for hiding this comment

regevran commented Jan 5, 2025 • edited Loading

soyacz commented Jan 7, 2025

scylladbbot commented Jan 27, 2025

kreuzerkrieg commented Dec 23, 2024 •

edited

Loading

kreuzerkrieg Jan 2, 2025 •

edited

Loading

regevran commented Jan 5, 2025 •

edited

Loading