`longevity-large-partition-200k-pks-4days-gce-test` test load should be revised/reworked to make the test stable #9640

dimakr · 2025-01-02T13:01:39Z

The longevity-large-partition-200k-pks-4days-gce-test test is mainly failing, with only 1 successful run out of 50 in scylla-master and 3 successful runs out of 18 in enterprise-2024.1, in 2024.
The symptoms are always the same - scylla_bench queries time out, resulting in a ConsistencyError event. Example error:

2024-12-02 07:49:07.131 <2024-12-02 07:49:06.000>: (ScyllaBenchLogEvent Severity.ERROR) period_type=one-time event_id=bee0ee67-379a-4125-90bd-1f915d973abd during_nemesis=RepairStreamingErr: type=ConsistencyError regex=received only line_number=43130 node=Node longevity-large-partitions-200k-pks-loader-node-85f2bab2-0-5 [35.227.73.24 | 10.142.1.62]
2024/12/02 07:49:06 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ?  LIMIT 10 " values=[1391 37146] consistency=QUORUM] || ERROR: 10 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

This issue was originally reported for ScyllaDB, with focus on enterprise - https://github.com/scylladb/scylla-enterprise/issues/4667. Initial assumptions suggested that a regression occurred between versions 2024.1.3 and 2024.1.4, causing performance degradation and subsequent s-b timeout failures.
However, after a few investigation cycles in that ticket, it was determined that the issue also affects earlier versions of 2024.1.

We need to revise and adjust the test configuration to improve its stability. Possible actions include:

revise and fine-tune the load in the test configuration
increase the retry count in s-b command

Few takeaways to start from, based on the latest updates in https://github.com/scylladb/scylla-enterprise/issues/4667:

https://github.com/scylladb/scylla-enterprise/issues/4667#issuecomment-2371939780

so it uses a MV and a large partition - this is something we do not recommend to our users and something that has known issues like https://github.com/scylladb/scylladb/issues/8873

https://github.com/scylladb/scylla-enterprise/issues/4667#issuecomment-2376307145

1. Check if the regression doesn't exist in 2024.1.0, a significant write regression was introduced in 2024.1.1

2. If we still blame MV and there is somewhere documentation stating that MV isn't recommended with large partitions table, we can use SkipPerIssue.

The text was updated successfully, but these errors were encountered:

fruch · 2025-01-02T16:17:29Z

@roydahan

this a case you refactored, any directions as for what we should do with this one ? which fine tuning do you suggest ?

fruch · 2025-01-02T16:18:11Z

@dimakr status on master doesn't say a lot, cause if one looks at it most of the failure are no like what you are describing

roydahan · 2025-01-02T20:00:13Z

I commented what I think we should do on 2024.1.
And it's in above description.

github-actions bot assigned dimakr Jan 2, 2025

dimakr removed their assignment Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`longevity-large-partition-200k-pks-4days-gce-test` test load should be revised/reworked to make the test stable #9640

`longevity-large-partition-200k-pks-4days-gce-test` test load should be revised/reworked to make the test stable #9640

dimakr commented Jan 2, 2025

fruch commented Jan 2, 2025

fruch commented Jan 2, 2025

roydahan commented Jan 2, 2025

longevity-large-partition-200k-pks-4days-gce-test test load should be revised/reworked to make the test stable #9640

longevity-large-partition-200k-pks-4days-gce-test test load should be revised/reworked to make the test stable #9640

Comments

dimakr commented Jan 2, 2025

fruch commented Jan 2, 2025

fruch commented Jan 2, 2025

roydahan commented Jan 2, 2025

`longevity-large-partition-200k-pks-4days-gce-test` test load should be revised/reworked to make the test stable #9640

`longevity-large-partition-200k-pks-4days-gce-test` test load should be revised/reworked to make the test stable #9640