Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self-test disk test enhancements #20590

Merged
merged 3 commits into from
Jul 19, 2024

Conversation

travisdowns
Copy link
Member

@travisdowns travisdowns commented Jun 27, 2024

rpk: add additional disk self tests

Add 16K block size disk tests, a common block size written by Redpanda,
at varying IO depths: 1, 8 and 32 times the shard count (the
multiplication by the shard count happens in Redpanda and is
inevitable).

This will help better assess the performance of block storage which is
a bit outside the usual, in particular how it response to io depth
changes.

Additionally, add a 4K test which is the same as the existing one but
with dsync off. This is critical to assess the impact of fdatasync
on the storage layer: locally, for me on my consumer SSD this makes a
257x difference (!!) in throughput though the effect is much more muted,
perhaps close to zero on other SSD types.

On the redpanda side, when we complete a self test the API returns
info about the runincluding an info field which says "write run" currently
(for a disk test). Enhance this to include information about whether
dsync was enabled and the total io depth (which is the client-specified
parallelism value times the number of shards).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Add more cases to the rpk disk self-test to better probe write performance at various IO depths, and at 16K block sizes. Return more information about the specifics of the test in the output.

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will slow down the test quite a bit but I guess that's not really a problem

Type: adminapi.DiskcheckTagIdentifier,
},
adminapi.DiskcheckParameters{
Name: "16KB sequential r/w, high io depth",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally not add something like 4k @ 256 iodepth?

Copy link
Member Author

@travisdowns travisdowns Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking more about "why not 4K" or "why not 256 iodepth"?

In any case it was intentional but open to ideas here. One thing to note is that the parallelism factor here is then multiplied by the shard count, so on modest 8 shard nodes we are already at a very high 512 io depth for parallelism=64, which IME is larger than what you need to get max throughput even on large local SSD configurations (though of course this may not be the case on some other storage configuations, especially high throughput, longer latency network attached storage).

I don't actually like this multiplication because it (a) adds a confounding factor when comparing results against different clusters which may have different shard counts (but at least now we see the effective iodepth in the output) and (b) it means you can't run an iodepth=1 test except on a cluster with 1-shard nodes.

About 4K vs 16K, my goal was to add a 16K test to see the difference between 4K and 16K, i.e., how much performance varies in the range of block sizes Redpanda is already writing with default settings. Then I also wanted to add a "series" of varying iodepth tests, which I sort of arbitrarily chose to be 16K one. I didn't want to do both to keep the number of tests down, and I think maybe I favored 16K over 4K in part because 4K already had parallelism=2, and I wanted 1 and didn't want to charge the existing 4K test to keep some continuity with old results.

That said, very open to changing it. What is your view on the ideal series of tests to run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that the parallelism factor here is then multiplied by the shard count

Wait but right now this all happens on shard zero only. Are you saying we still multiply it by the shard count?

That said, very open to changing it. What is your view on the ideal series of tests to run?

I don't feel strongly. Just really coming from the classic 4k test and I guess it matches the min amount we write.

I guess the 512Kib test is actually the least relevant one for RP as we never write sizes bigger than 16Kib (only when fetching from TS).

Copy link
Member Author

@travisdowns travisdowns Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but right now this all happens on shard zero only. Are you saying we still multiply it by the shard count?

No I was simply mistaken. I thought this ran on all shards, but as you say it seems to run only one shard. I was thrown off especially by this comnent and also this code and comment. Perhaps vestigial?

So I will adjust the numbers to hit higher io depths, and maybe add 1 more test.

Just really coming from the classic 4k test and I guess it matches the min amount we write.

I'll change it to 4K.

I guess the 512Kib test is actually the least relevant one for RP as we never write sizes bigger than 16Kib (only when fetching from TS).

It's definitely the least useful for evaluating RP performance at the default settings. As a test to understand more about the disk, especially disks with characteristics different than the most common ones we run on I think it's fine because it is a "max throughput" test, and if it gets a much higher number than the other tests with small blocks then we've learned something.

r-vasquez
r-vasquez previously approved these changes Jun 27, 2024
@travisdowns
Copy link
Member Author

travisdowns commented Jun 28, 2024

This will slow down the test quite a bit but I guess that's not really a problem

If we want more data points and we want to keep the same duration per test, I don't really see an alternative to that. However, we could always reduce the default per-test duration if the overall current duration (2 minutes, at default per-test duration) is a "sweet spot" or somethign like that.

Note that most of these newly added tests have skipRead=true so they are half the time of the existing tests, so the time expansion is actually half of what you'd guess by looking at it. The increase is 4 tests -> 8 tests, so 2 minutes to 4 minutes at default duration.

@travisdowns travisdowns reopened this Jun 28, 2024
@travisdowns
Copy link
Member Author

Stupid "close with comment" button sitting there looking so pressable.

@travisdowns
Copy link
Member Author

Updated in push: 0c1753b

  • Removed the io_depth() method and stopped assuming the parallelism was multiplied by the shard count.
  • Changed the "io depth" sequence from 16K to 4K, except for the iodepth=1 test, only write test is done. Kept one 16K r/w test at 64 io depth. The no dsync test is at 4K, 64 io depth.
  • Removed ", dsync" from the description of the 512k r/w test since it doesn't make sense for the "read" part.
  • Fixed tests that said r/w when they were actually only write.
  • Aligned --help output with these changes.

Example output after this change:

NODE ID: 0 | STATUS: IDLE
=========================
NAME        512KB sequential r/w
INFO        write run (iodepth: 4, dsync: true)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1009ms
IOPS        425 req/sec
THROUGHPUT  212.5MiB/sec
LATENCY     P50     P90      P99      P999     MAX
            9215us  11775us  14847us  21503us  21503us

NAME        512KB sequential r/w
INFO        read run
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1000ms
IOPS        10147 req/sec
THROUGHPUT  4.955GiB/sec
LATENCY     P50    P90    P99    P999    MAX
            247us  639us  799us  1087us  1215us

NAME        4KB sequential r/w, low io depth
INFO        write run (iodepth: 1, dsync: true)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1002ms
IOPS        414 req/sec
THROUGHPUT  1.617MiB/sec
LATENCY     P50     P90     P99     P999    MAX
            2431us  2559us  2687us  5887us  5887us

NAME        4KB sequential r/w, low io depth
INFO        read run
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1000ms
IOPS        621714 req/sec
THROUGHPUT  2.372GiB/sec
LATENCY     P50   P90   P99   P999  MAX
            1us   1us   2us   23us  543us

NAME        4KB sequential write, medium io depth
INFO        write run (iodepth: 8, dsync: true)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1014ms
IOPS        523 req/sec
THROUGHPUT  2.043MiB/sec
LATENCY     P50      P90      P99      P999     MAX
            15871us  16383us  20479us  20479us  21503us

NAME        4KB sequential write, high io depth
INFO        write run (iodepth: 64, dsync: true)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1115ms
IOPS        607 req/sec
THROUGHPUT  2.371MiB/sec
LATENCY     P50       P90       P99       P999      MAX
            118783us  126975us  139263us  139263us  180223us

NAME      4KB sequential write, very high io depth
TYPE      disk
TEST ID   931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS  0
DURATION  0ms
ERROR     IO Queue depth (parallelism) out of range, min is 1, max 256

NAME        4KB sequential write, no dsync
INFO        write run (iodepth: 64, dsync: false)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1000ms
IOPS        366771 req/sec
THROUGHPUT  1.399GiB/sec
LATENCY     P50    P90    P99    P999   MAX
            167us  231us  303us  735us  1151us

NAME        16KB sequential r/w, high io depth
INFO        write run (iodepth: 64, dsync: false)
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1000ms
IOPS        195040 req/sec
THROUGHPUT  2.976GiB/sec
LATENCY     P50    P90    P99    P999   MAX
            319us  367us  431us  479us  543us

NAME        16KB sequential r/w, high io depth
INFO        read run
TYPE        disk
TEST ID     931e192d-2133-4304-b093-3586d18b0c56
TIMEOUTS    0
DURATION    1000ms
IOPS        197272 req/sec
THROUGHPUT  3.01GiB/sec
LATENCY     P50    P90    P99    P999   MAX
            335us  367us  463us  639us  1023us

The help output:

Starts one or more benchmark tests on one or more nodes
of the cluster. Available tests to run:

* Disk tests:
  * Throughput test: 512 KB messages, sequential read/write
    * Uses a larger request message sizes and deeper I/O queue depth to write/read more bytes in a shorter amount of time, at the cost of IOPS/latency.
  * Latency and io depth tests: 4 KB messages, sequential read/write, varying io depth
    * Uses small IO sizes and varying levels of parallelism to determine the relationship between io depth and IOPS
        * Includes one test without using dsync (fdatasync) on each write to establish the cost of dsync
  * 16 KB test
    * One high io depth test at 16 KB to reflect performance at Redpanda's default chunk size

@travisdowns
Copy link
Member Author

/dt

1 similar comment
@travisdowns
Copy link
Member Author

/dt

@travisdowns
Copy link
Member Author

/ci-repeat 1

StephanDollberg
StephanDollberg previously approved these changes Jul 2, 2024
dotnwat
dotnwat previously approved these changes Jul 2, 2024
r-vasquez
r-vasquez previously approved these changes Jul 2, 2024
kbatuigas
kbatuigas previously approved these changes Jul 3, 2024
Copy link

@kbatuigas kbatuigas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor copy edits for consistency with public docs

src/go/rpk/pkg/cli/cluster/selftest/start.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/cli/cluster/selftest/start.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/cli/cluster/selftest/start.go Outdated Show resolved Hide resolved
@travisdowns travisdowns dismissed stale reviews from kbatuigas and r-vasquez via 5944298 July 4, 2024 04:02
@travisdowns
Copy link
Member Author

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 5, 2024

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812c-f3ce-4e04-840f-426fdcd3fac9:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812c-f3cf-4a1d-97e8-aec9c71db760:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812c-f3d1-4683-8a5f-77831d2deecd:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812c-f3cc-455c-b0d2-212dcdab44f1:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812e-d95c-486b-9d4d-89b31bda8c5b:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812e-d95e-4eb2-b6b1-7dd3881feba2:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812e-d957-4a96-b015-473add4dc93b:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51128#0190812e-d959-46e9-b67c-0e9571b17b33:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51441#0190a8a4-3b07-4103-86be-2e71180e4479:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51441#0190a8a4-3b05-4dc7-9d4b-4878bd5eb84b:
pandatriage cache was not found

skipped ducktape retry in https://buildkite.com/redpanda/redpanda/builds/51441#0190a8bc-745d-4058-9d93-e84bc9906f5d:
pandatriage cache was not found

@travisdowns
Copy link
Member Author

OK these remaining errors seem legit, looking.

@travisdowns
Copy link
Member Author

365233c is a pure rebase.

f01195a changes the max iodepth in the new tests to 256 from 512, as RP has a hardcoded limit of 256 in the self test code. I also considered increasing this limit from 256 to 512 on the RP side but then we'd have issues running self-test in cases where the RPK version was newer than the Redpanda version, which is a supported and I think fairly common scenario, so I decided to change RPK instead.

This should fix the test failures.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 11, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a306-08a6-4d35-b300-695b00ac2af8:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=False.remote_write=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a306-08a8-4562-a22d-e76610db099a:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=False.remote_write=True"
"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a306-08aa-43eb-8b38-c266ab6f395f:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=True.remote_write=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a306-08ac-4d5b-81cd-1ebe01054b59:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=True.remote_write=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a307-bacc-47d3-ab43-fa4842e939fd:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=False.remote_write=True"
"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a307-baca-43b9-b035-d624e18befca:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=False.remote_write=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a307-bac8-4f3f-b1d7-0a027f3a1d46:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=True.remote_write=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/51367#0190a307-bace-4ce7-865c-f95725cb07e4:

"rptest.tests.self_test_test.SelfTestTest.test_self_test.remote_read=True.remote_write=False"

@travisdowns
Copy link
Member Author

Hopefully this last push fixes all the failures. All the tests were passing for me locally but it turned out it just because of https://github.com/redpanda-data/vtools/pull/2950 not rebuilding my RPK.

@travisdowns
Copy link
Member Author

All spurious failures, retrying.

@travisdowns
Copy link
Member Author

/ci-repeat 1

1 similar comment
@travisdowns
Copy link
Member Author

/ci-repeat 1

@travisdowns
Copy link
Member Author

Spurious GH download failure in last run.

@travisdowns
Copy link
Member Author

/ci-repeat 1

@travisdowns
Copy link
Member Author

Last failure was a merge conflict, fixed. Hopefully this CI run is the one.

Add 16K block size disk tests, a common block size written by Redpanda,
at varying IO depths: 1, 8 and 32 times the shard count (the
multiplication by the shard count happens in Redpanda and is
inevitable).

This will help better assess the performance of block storage which is
a bit outside the usual, in particular how it response to io depth
changes.

Additionally, add a 4K test which is the same as the existing one but
with dsync off. This is critical to assess the impact of fdatasync
on the storage layer: locally, this makes a 257x difference in
throughput though the effect is much more muted, perhaps close to zero
on other SSD types.

Rename slightly the tests to remove extraneous info.

Issue redpanda-data/core-internal#1266.
Set the name to unspecified, which is more accurate reflection of the
situation when the caller doesn't set a name.

Fix a comment which said 1G but was 10G.
When we complete a self test the API returns info about the run
including an info field which says "write run" currently (for a disk
test). Enhance this to include information about whether dsync
was enabled and the total io depth.

Issue redpanda-data/core-internal#1266.
@travisdowns
Copy link
Member Author

bd8b94d is to fix yet another merge conflict (what's up with my luck on this change?).

@travisdowns travisdowns merged commit 080ac33 into redpanda-data:dev Jul 19, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants