-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(backup): Measure read/write latency during back up #9605
base: master
Are you sure you want to change the base?
test(backup): Measure read/write latency during back up #9605
Conversation
a3d1d6d
to
f003207
Compare
Added qa-mainteiners to Reviewers |
If you have expectations for the metrics (like p99's below 8ms) consider using |
I strongly oppose this type of validation as it may cause more harm than good. Occasionally, factors like "noisy neighbors" can cause failures, leading people to spend days troubleshooting. Moreover, this validation is specific to a particular type of AWS instance. Using any other instance type will result in failure. |
70d7c22
to
22c48dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments about the stress commands.
+
It needs to have a jenkins pipeline.
bc63818
to
848b5e2
Compare
What we are going to achieve by adding (clonning) a new pipeline? |
b9ef7fb
to
bb7483e
Compare
e90ffb6
to
b5e40e0
Compare
"cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=52428801..78643200", | ||
"cassandra-stress write cl=ALL n=26214400 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=500 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=78643201..104857600" ] | ||
|
||
stress_cmd: "cassandra-stress mixed cl=QUORUM duration=10m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=100 fixed=17000/s' -col 'size=FIXED(1024) n=FIXED(1)'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you base the "17000/s" on?
A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s.
4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput.
(BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).
lastly, I think that 10m may be too short to get consistent results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you base the "17000/s" on? A typical max throughput test for mixed workload using i4i.4xlarge achieves around 500K op/s. 4 loaders running 17K op/s is equal to 68K op/s which is about 13% of the max throughput. (BTW, to achieve that we use 620 threads and bigger loaders - you can probably start with c7i.2xlarge).
ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?
lastly, I think that 10m may be too short to get consistent results.
how long it will be considered OK-ish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then how do I calculate number of threads and op/s to get to 6-7ms latency?
Is your target to measure latency during the backup operation or to measure duration of the operation under a specific load?
If you want to measure latency, you don't know the latency :)
We usually measure the latency under a workload which is 50% of max throughput.
in your case it would be something like "-rate 'threads=620 fixed=62500/s' "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long it will be considered OK-ish?
How long does the backup operation take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long it will be considered OK-ish?
How long does the backup operation take?
10min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location of this should be probably under performance and not under manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which bring me to the question of, why didn't you just replicate the same test we have for latency_with_operations and replaced the operation with a manager backup nemesis.
If needed, you could just define new manager backup nemesis.
Please note that this test is to create a baseline, something to compare to when we run the same test but based on Scylla core backup capabilities. |
* 50% reads and 50% writes. * Compaction enabled. * 75%-85% CPU utilization * Expect up to 6-7ms P99 read latencies. * Expect not more than 10ms read latency during backup
c9c7c9a
to
fd413b0
Compare
@regevran see you can apply validation rules that will fail the test if value exceeds error threshold (configurable, see validation rules examples in other tests). In case of setting custom rules for latency_decorator_calculator, see example . btw. after rebase you can remove |
@kreuzerkrieg new branch |
RW latency during backup using
rclone
Test requirements
Argus run
https://argus.scylladb.com/tests/scylla-cluster-tests/b6a76dc0-a6a8-47b8-9dc0-5eff576866e5
Results:
We can safely say that nor read neither write are affected by backup
@bhalevy / @regevran FYI