Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8S] nodetool_enospc nemesis (IOChaos experiment) causes target Scylla service to exit with code 1 and be restarted #6327

Open
1 of 3 tasks
vponomaryov opened this issue Jul 4, 2023 · 10 comments
Assignees
Labels
Bug Something isn't working right scylla-operator

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Jul 4, 2023

Prerequisites

  • Are you rebased to master ?
  • Is it reproducible ?
  • Did you perform a cursory search if this issue isn't opened ?

Versions

  • SCT: master
  • scylla: 2022.2.9-20230618.843304f9f734
  • Operator Image: scylladb/scylla-operator:1.9.0-rc.2

Logs

  • test_id: 11a8f5f4-ac7b-405c-bc23-f6c527afc4b9
  • job log: Argus , CI job

Description

SCT logs have following:

< t:2023-07-03 15:12:25,418 f:nemesis.py      l:4403 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Start disruption at `2023-07-03 15:12:25.418387`
...
< t:2023-07-03 15:12:26,716 f:chaos_mesh.py   l:152  c:sdcm.utils.k8s.chaos_mesh p:DEBUG > Starting a IOChaos experiment io-fault-sct-cluster-us-east1-b-us-east1-0-03-15.12.26
...
< t:2023-07-03 15:12:26,719 f:__init__.py     l:344  c:sdcm.utils.k8s       p:DEBUG > Processing '/tmp/tmpafbdmvwy.yaml' file.
...
< t:2023-07-03 15:12:26,737 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230703-113016-202915/.kube/http-cache apply  -f /tmp/tmputycfueg"...
< t:2023-07-03 15:12:27,276 f:base.py         l:222  c:LocalCmdRunner       p:DEBUG > iochaos.chaos-mesh.org/io-fault-sct-cluster-us-east1-b-us-east1-0-03-15.12.26 created
< t:2023-07-03 15:12:27,290 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230703-113016-202915/.kube/http-cache apply  -f /tmp/tmputycfueg" finished with status 0
< t:2023-07-03 15:12:27,290 f:chaos_mesh.py   l:158  c:sdcm.utils.k8s.chaos_mesh p:INFO  > 'IOChaos' experiment 'io-fault-sct-cluster-us-east1-b-us-east1-0-03-15.12.26' has started
< t:2023-07-03 15:12:27,290 f:chaos_mesh.py   l:190  c:sdcm.utils.k8s.chaos_mesh p:DEBUG > waiting until 'io-fault-sct-cluster-us-east1-b-us-east1-0-03-15.12.26' experiment ends...

And the logs of the target node:

INFO  2023-07-03 15:11:50,583 [shard 3] compaction - [Compact keyspace1.standard1 e7521260-19b3-11ee-8e68-8ada78469dbb] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-1e4aade0199b11eeac13ea102ed4893c/me-1536-big-Data.db:level=0]. 677MB to 361MB (~53% of original) in 13997ms = 25MB/s. ~2670848 total partitions merged to 1426333.
ERROR 2023-07-03 15:12:28,271 [shard 1] seastar - io_submit: Operation not supported
ERROR 2023-07-03 15:12:28,271 [shard 4] seastar - io_submit: Operation not supported
2023-07-03 15:12:28,409 INFO exited: scylla (exit status 1; not expected)
2023-07-03 15:12:29,412 INFO spawned: 'scylla' with pid 9552
Scylla version 2022.2.9-0.20230618.843304f9f734 with build-id a34753ee38bccbaf461e04ae0e63e17afe45e048 starting ...

We see that Scylla service exits with code 1 as a reaction to the IOChaos experiment.

The nemesis is marked as passed but the test run stops after this error.

Screenshot from 2023-07-04 13-21-27

Steps to Reproduce

  1. Deploy scylla-operator on GKE
  2. Deploy 3-member ScyllaCluster
  3. Run the disrupt_enospc nemesis
  4. See error

Expected behavior: Scylla service must continue to run

Actual behavior: Scylla service immediately exits with code 1.

@vponomaryov
Copy link
Contributor Author

@soyacz have you seen this problem?

@vponomaryov vponomaryov changed the title nodetool_enospc nemesis (IOChaos experiment) causes target Scylla service to exit with code 1 and be restarted [K8S] nodetool_enospc nemesis (IOChaos experiment) causes target Scylla service to exit with code 1 and be restarted Jul 4, 2023
@vponomaryov vponomaryov added scylla-operator Bug Something isn't working right labels Jul 4, 2023
@fruch
Copy link
Contributor

fruch commented Jul 4, 2023

@soyacz have you seen this problem?

Sounds like the expected flow, if in case of ENSPC Scylla dies, and we should wait for it to be back up.

@soyacz
Copy link
Contributor

soyacz commented Jul 4, 2023

seastar - io_submit: Operation not supported - I didn't see it before, maybe we were lucky before (or I missed it). Maybe mock used by ChaosMesh is not supporting this operation and fails this way.

Nevertheless, scylla restarts, finds enospc errors and later must restart again.

The test stops due end of time or some other error?

@vponomaryov
Copy link
Contributor Author

vponomaryov commented Jul 4, 2023

The test stops due end of time or some other error?

The tearDown starts right after the error described here caused by the end of the stress:

< t:2023-07-03 15:14:42,506 f:tester.py       l:2821 c:LongevityTest        p:INFO  > TearDown is starting...

@soyacz
Copy link
Contributor

soyacz commented Jul 18, 2023

@vponomaryov maybe this error happens due new way of provisioning disks? Maybe now we can fill up node directory up to the full instead of playing with os api?

@vponomaryov
Copy link
Contributor Author

@vponomaryov maybe this error happens due new way of provisioning disks? Maybe now we can fill up node directory up to the full instead of playing with os api?

It may be related to the new way of the storage setup, and if so, then, probably, it is related to the XFS configuration.

@vponomaryov
Copy link
Contributor Author

Retesting other fixes for other bugs this one was hit all the time.
So, I plan to create a PR with skipping of that nemesis for now.

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Aug 17, 2023
The 'nodetool_enospc' nemesis fails all the time on K8S [1].
So, skip it until the the bug gets fixed.

[1] scylladb#6327
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Aug 17, 2023
The 'nodetool_enospc' nemesis fails all the time on K8S [1].
So, skip it until the the bug gets fixed.

[1] scylladb#6327
vponomaryov added a commit that referenced this issue Aug 17, 2023
The 'nodetool_enospc' nemesis fails all the time on K8S [1].
So, skip it until the the bug gets fixed.

[1] #6327
@soyacz
Copy link
Contributor

soyacz commented Aug 31, 2023

I think we no longer can use Chaos Mesh for this - looks previously io_submit() was not used (IIUC scylla detects if it is feasible to use it) and the used filesystem mock (https://docs.rs/fuser/0.7.0/fuser/trait.Filesystem.html) does not support it (and doesn't look it will).

We need to rewrite this nemesis to use different means of filling up disk.

@fruch
Copy link
Contributor

fruch commented Aug 31, 2023

Let's file a bug on https://github.com/cberner/fuser, with the details, and why we need io_submit support.
and maybe with Chaos Mesh as well, if there a place to file bugs.

and meanwhile lets disable this nemesis.

we'll then consider other option based on the response from community.

@soyacz
Copy link
Contributor

soyacz commented Aug 31, 2023

Let's file a bug on https://github.com/cberner/fuser, with the details, and why we need io_submit support. and maybe with Chaos Mesh as well, if there a place to file bugs.

and meanwhile lets disable this nemesis.

we'll then consider other option based on the response from community.

Issues created:
cberner/fuser#261
chaos-mesh/chaos-mesh#4173

In the meanwhile @vponomaryov already disabled this nemesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working right scylla-operator
Projects
None yet
Development

No branches or pull requests

3 participants