Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

longevity-large-partition-200k-pks-4days-gce-test test load should be revised/reworked to make the test stable #9640

Open
dimakr opened this issue Jan 2, 2025 · 3 comments

Comments

@dimakr
Copy link
Contributor

dimakr commented Jan 2, 2025

The longevity-large-partition-200k-pks-4days-gce-test test is mainly failing, with only 1 successful run out of 50 in scylla-master and 3 successful runs out of 18 in enterprise-2024.1, in 2024.
The symptoms are always the same - scylla_bench queries time out, resulting in a ConsistencyError event. Example error:

2024-12-02 07:49:07.131 <2024-12-02 07:49:06.000>: (ScyllaBenchLogEvent Severity.ERROR) period_type=one-time event_id=bee0ee67-379a-4125-90bd-1f915d973abd during_nemesis=RepairStreamingErr: type=ConsistencyError regex=received only line_number=43130 node=Node longevity-large-partitions-200k-pks-loader-node-85f2bab2-0-5 [35.227.73.24 | 10.142.1.62]
2024/12/02 07:49:06 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ?  LIMIT 10 " values=[1391 37146] consistency=QUORUM] || ERROR: 10 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

This issue was originally reported for ScyllaDB, with focus on enterprise - https://github.com/scylladb/scylla-enterprise/issues/4667. Initial assumptions suggested that a regression occurred between versions 2024.1.3 and 2024.1.4, causing performance degradation and subsequent s-b timeout failures.
However, after a few investigation cycles in that ticket, it was determined that the issue also affects earlier versions of 2024.1.

We need to revise and adjust the test configuration to improve its stability. Possible actions include:

  • revise and fine-tune the load in the test configuration
  • increase the retry count in s-b command

Few takeaways to start from, based on the latest updates in https://github.com/scylladb/scylla-enterprise/issues/4667:

so it uses a MV and a large partition - this is something we do not recommend to our users and something that has known issues like https://github.com/scylladb/scylladb/issues/8873
1. Check if the regression doesn't exist in 2024.1.0, a significant write regression was introduced in 2024.1.1

2. If we still blame MV and there is somewhere documentation stating that MV isn't recommended with large partitions table, we can use SkipPerIssue.
@dimakr dimakr removed their assignment Jan 2, 2025
@fruch
Copy link
Contributor

fruch commented Jan 2, 2025

@roydahan

this a case you refactored, any directions as for what we should do with this one ? which fine tuning do you suggest ?

@fruch
Copy link
Contributor

fruch commented Jan 2, 2025

@dimakr status on master doesn't say a lot, cause if one looks at it most of the failure are no like what you are describing

@roydahan
Copy link
Contributor

roydahan commented Jan 2, 2025

I commented what I think we should do on 2024.1.
And it's in above description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants