Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crucible agent behaves poorly when out of space #1588

Open
leftwo opened this issue Dec 18, 2024 · 2 comments
Open

crucible agent behaves poorly when out of space #1588

leftwo opened this issue Dec 18, 2024 · 2 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Dec 18, 2024

On rack2, we managed to run a few datasets out of space.

As a side effect, crucible-agent could not make progress, and even worse started dumping core over and over:

Here is a subest of the cores from a single minute:

total 757291379
-rw-r--r--   1 root     root     26917789 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.7252.1733796979
-rw-r--r--   1 root     root     26917789 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8461.1733797036
-rw-r--r--   1 root     root     26921885 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.7424.1733796987
-rw-r--r--   1 root     root     26921885 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8553.1733797040
-rw-r--r--   1 root     root     26917789 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8098.1733797019
-rw-r--r--   1 root     root     26917789 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8291.1733797028
-rw-r--r--   1 root     root     26921885 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8422.1733797034
-rw-r--r--   1 root     root     26913693 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.8192.1733797023
-rw-r--r--   1 root     root     26913693 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.7309.1733796982
-rw-r--r--   1 root     root     26934173 Dec 10 02:19 core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.7288.1733796981

The backtrace is not very helpful:

alan@atrium:62426615$ mdb core.oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508.crucible-agent.7252.1733796979
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $G
C++ symbol demangling enabled
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0xe47159()
0xe47149()
rust_panic+0xd()
std::panicking::rust_panic_with_hook::hfbc2313a4516d465+0x8bf()
std::panicking::begin_panic_handler::{{closure}}::h8b784919ae9f75f5+0x95()
0xe2d309()
0xe2fc94()
0xe74692()
0x7bcf0f()
std::sys::backtrace::__rust_begin_short_backtrace::hf8d8c5b430cce8f4+0xd6()
core::ops::function::FnOnce::call_once{{vtable.shim}}::he92ab9824b080f15+0x99()
std::sys::pal::unix::thread::Thread::new::thread_start::hce81063711d644a3+0x1b()
libc.so.1`_thrp_setup+0x77(fffff5ffeef50240)
libc.so.1`_lwp_start()

stacks gives me a potential clue:

> ::stacks
THREAD           STATE    SOBJ                COUNT
2                PARKED   CV                    127
                 libc.so.1`cond_wait_queue+0x55
                 libc.so.1`__cond_wait+0xba
                 libc.so.1`cond_wait+0x2e
                 libc.so.1`pthread_cond_wait+0x15
                 parking_lot::condvar::Condvar::wait_until_internal::h9ac5d0fe0434d0b0+0x29b
                 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h64564560d7db38b9+0x250
                 tokio::runtime::scheduler::multi_thread::worker::Context::run::h0a1eb4b34914818a+0xb92
                 tokio::runtime::context::runtime::enter_runtime::hcf316515c21b7913+0x198
                 tokio::runtime::scheduler::multi_thread::worker::run::h0b7403368692f0ef+0x7c
                 tokio::runtime::task::core::Core<T,S>::poll::h318ac036334c2877+0x73
                 tokio::runtime::task::harness::Harness<T,S>::poll::h497c0b5063ffffd2+0xad
                 tokio::runtime::blocking::pool::Inner::run::h0021e3bd3e092882+0xe4
                 std::sys::backtrace::__rust_begin_short_backtrace::h9add8406ddf976c1+0x72
                 core::ops::function::FnOnce::call_once{{vtable.shim}}::h65aea51f2b98b367+0x7e
                 std::sys::pal::unix::thread::Thread::new::thread_start::hce81063711d644a3+0x1b
                 libc.so.1`_thrp_setup+0x77
                 libc.so.1`_lwp_start

5                UNPARKED <NONE>                  1
                 mio::poll::Poll::poll::hc8e472941e64088a+0x75
                 tokio::runtime::io::driver::Driver::turn::h3b096e81cd961d8a +0x18f
                 tokio::runtime::time::Driver::park_internal::h8fb83c4242ddb3f8 +0x478
                 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h64564560d7db38b9+0x147
                 tokio::runtime::scheduler::multi_thread::worker::Context::run::h0a1eb4b34914818a+0xb92
                 tokio::runtime::context::runtime::enter_runtime::hcf316515c21b7913+0x198
                 tokio::runtime::scheduler::multi_thread::worker::run::h0b7403368692f0ef+0x7c
                 tokio::runtime::task::core::Core<T,S>::poll::h318ac036334c2877+0x73
                 tokio::runtime::task::harness::Harness<T,S>::poll::h497c0b5063ffffd2+0xad
                 tokio::runtime::blocking::pool::Inner::run::h0021e3bd3e092882+0xe4
                 std::sys::backtrace::__rust_begin_short_backtrace::h9add8406ddf976c1+0x72
                 core::ops::function::FnOnce::call_once{{vtable.shim}}::h65aea51f2b98b367+0x7e
                 std::sys::pal::unix::thread::Thread::new::thread_start::hce81063711d644a3+0x1b
                 libc.so.1`_thrp_setup+0x77
                 libc.so.1`_lwp_start

1                UNPARKED <NONE>                  1
                 std::sys::pal::unix::process::process_inner::<impl std::sys::pal::unix::process::process_common::Command>::spawn::h881e8de4aa30dfc8+0x32f
                 std::process::Command::output::h96fe6c5d56dce4e6+0x2e
                 crucible_agent::ZFSDataset::new::hf55a022b3d997133+0xb2
                 tokio::runtime::park::CachedParkThread::block_on::h3cda8980b25cac95+0x6dd
                 tokio::runtime::context::runtime::enter_runtime::he2aa2680eca7fb2e+0x140
                 tokio::runtime::runtime::Runtime::block_on_inner::h560b8125ddf9328f+0x5c
                 crucible_agent::main::hb387919dfb5882f8+0x99
                 std::sys::backtrace::__rust_begin_short_backtrace::h63eeaef591368af9+6
                 std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::haaaeab14c02f7e8c +0x11
                 std::rt::lang_start_internal::haaa1b077e6cd45ae+0x1fc
                 main+0x2c
                 _start_crt+0x87
                 _start+0x18

82               UNPARKED <NONE>                  1
                 libc.so.1`raise+0x22
                 libc.so.1`abort+0x58
                 0xe47159
                 0xe47149
                 rust_panic+0xd
                 std::panicking::rust_panic_with_hook::hfbc2313a4516d465+0x8bf
                 std::panicking::begin_panic_handler::{{closure}}::h8b784919ae9f75f5+0x95
                 0xe2d309
                 0xe2fc94
                 0xe74692
                 0x7bcf0f
                 std::sys::backtrace::__rust_begin_short_backtrace::hf8d8c5b430cce8f4+0xd6
                 core::ops::function::FnOnce::call_once{{vtable.shim}}::he92ab9824b080f15+0x99
                 std::sys::pal::unix::thread::Thread::new::thread_start::hce81063711d644a3+0x1b
                 libc.so.1`_thrp_setup+0x77
                 libc.so.1`_lwp_start

Maybe we are off in Command spawn trying to create a dataset and... well.. I'm not sure what other than the attempt to create a dataset is going to fail.

The agent log from the time knows something bad has happened, but no panic messages are logged:
we have many failures like this:

2024-12-10 01:38:06.019Z ERRO crucible-agent/24992 (worker) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: Dataset 2cac30ef-1d1e-40ff-b7cc-979c01286442 creation failed: zfs create failed! out: err:cannot create 'oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible/regions/2cac30ef-1d1e-40ff-b7cc-979c01286442': out of space
2024-12-10 01:38:06.019Z INFO crucible-agent/24992 (datafile) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: region 2cac30ef-1d1e-40ff-b7cc-979c01286442 state: Requested -> Failed
2024-12-10 01:38:06.755Z INFO crucible-agent/24992 (dropshot) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: request completed

Then, we see that the service has restarted, this is right around the time on the core file. But, no panic messages in the log.

[ Dec 10 02:15:46 Stopping because all processes in service exited. ]
[ Dec 10 02:15:46 Executing start method ("/opt/oxide/crucible/bin/crucible-agent run -D /opt/oxide/crucible/bin/crucible-downstairs --dataset oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible -l [fd00:1122:3344:109::4]:32345 -P 19000 -p downstairs -s snapshot"). ]
note: configured to log to "/dev/stdout"
2024-12-10 02:15:46.746Z INFO crucible-agent/6570 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: dataset: "oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible"
2024-12-10 02:15:46.755Z INFO crucible-agent/6570 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listen IP: [fd00:1122:3344:109::4]:32345
2024-12-10 02:15:46.755Z INFO crucible-agent/6570 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: SMF instance name downstairs_prefix: "downstairs"
2024-12-10 02:15:46.782Z INFO crucible-agent/6570 (datafile) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: Using conf_path:"/data/crucible.json"
2024-12-10 02:15:46.872Z INFO crucible-agent/6570 (dropshot) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listening
    local_addr = [fd00:1122:3344:109::4]:32345
[ Dec 10 02:17:33 Stopping because all processes in service exited. ]
[ Dec 10 02:17:33 Executing start method ("/opt/oxide/crucible/bin/crucible-agent run -D /opt/oxide/crucible/bin/crucible-downstairs --dataset oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible -l [fd00:1122:3344:109::4]:32345 -P 19000 -p downstairs -s snapshot"). ]
note: configured to log to "/dev/stdout"
2024-12-10 02:17:33.706Z INFO crucible-agent/8871 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: dataset: "oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible"
2024-12-10 02:17:33.713Z INFO crucible-agent/8871 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listen IP: [fd00:1122:3344:109::4]:32345
2024-12-10 02:17:33.713Z INFO crucible-agent/8871 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: SMF instance name downstairs_prefix: "downstairs"
2024-12-10 02:17:33.741Z INFO crucible-agent/8871 (datafile) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: Using conf_path:"/data/crucible.json"
2024-12-10 02:17:33.828Z INFO crucible-agent/8871 (dropshot) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listening
    local_addr = [fd00:1122:3344:109::4]:32345
[ Dec 10 02:17:56 Stopping because all processes in service exited. ]
[ Dec 10 02:17:56 Executing start method ("/opt/oxide/crucible/bin/crucible-agent run -D /opt/oxide/crucible/bin/crucible-downstairs --dataset oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible -l [fd00:1122:3344:109::4]:32345 -P 19000 -p downstairs -s snapshot"). ]
note: configured to log to "/dev/stdout"
2024-12-10 02:17:56.292Z INFO crucible-agent/9148 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: dataset: "oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible"
2024-12-10 02:17:56.300Z INFO crucible-agent/9148 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listen IP: [fd00:1122:3344:109::4]:32345
2024-12-10 02:17:56.300Z INFO crucible-agent/9148 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: SMF instance name downstairs_prefix: "downstairs"
2024-12-10 02:17:56.324Z INFO crucible-agent/9148 (datafile) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: Using conf_path:"/data/crucible.json"
2024-12-10 02:17:56.406Z INFO crucible-agent/9148 (dropshot) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listening
    local_addr = [fd00:1122:3344:109::4]:32345
[ Dec 10 02:17:58 Stopping because all processes in service exited. ]
[ Dec 10 02:17:59 Executing start method ("/opt/oxide/crucible/bin/crucible-agent run -D /opt/oxide/crucible/bin/crucible-downstairs --dataset oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible -l [fd00:1122:3344:109::4]:32345 -P 19000 -p downstairs -s snapshot"). ]
note: configured to log to "/dev/stdout"
2024-12-10 02:17:59.020Z INFO crucible-agent/9157 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: dataset: "oxp_ae56280b-17ce-4266-8573-e1da9db6c6bb/crucible"
2024-12-10 02:17:59.027Z INFO crucible-agent/9157 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listen IP: [fd00:1122:3344:109::4]:32345
2024-12-10 02:17:59.027Z INFO crucible-agent/9157 on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: SMF instance name downstairs_prefix: "downstairs"
2024-12-10 02:17:59.051Z INFO crucible-agent/9157 (datafile) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: Using conf_path:"/data/crucible.json"
2024-12-10 02:17:59.138Z INFO crucible-agent/9157 (dropshot) on oxz_crucible_90b53c3d-42fa-4ca9-bbfc-96fff245b508: listening
    local_addr = [fd00:1122:3344:109::4]:32345
note: configured to log to "/dev/stdout"
@leftwo
Copy link
Contributor Author

leftwo commented Dec 18, 2024

The logs from above (and a bunch of other logs and core files) are all here:
/staff/core/crucible-agent-mess/BRM42220017/62426615

@jmpesp
Copy link
Contributor

jmpesp commented Dec 19, 2024

It might be exiting here:

crucible/agent/src/main.rs

Lines 308 to 318 in bd3cc54

// Apply any outstanding actions.
//
// Note: ? here means that the failure of apply_smf will cause the
// binary to terminate.
apply_smf(
&log,
&df,
regions_dataset.path().unwrap(),
&downstairs_prefix,
&snapshot_prefix,
)?;

that ? would exit without a panic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants