Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% cpu usage #519

Open
dpc opened this issue Feb 24, 2025 · 7 comments
Open

100% cpu usage #519

dpc opened this issue Feb 24, 2025 · 7 comments

Comments

@dpc
Copy link

dpc commented Feb 24, 2025

Describe the bug

I cough my long running daemon project consuming 100% of the cpu time (one core), multiple times already, but the problem seems to happen randomly after a while). Plugging gdb suggests it's iroh's discovery, in the cordyceps dependency.

#0  drop<futures_buffered::arc_slice::ArcSlotInner> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cordyceps-0.3.2/src/mpsc_queue.rs:827
#1  drop_in_place<cordyceps::mpsc_queue::MpscQueue<futures_buffered::arc_slice::ArcSlotInner>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:523
#2  drop_in_place<futures_buffered::arc_slice::ArcSliceInnerMeta> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:523
#3  drop_inner () at src/arc_slice.rs:353
#4  0x00005570405cc8b9 in drop_in_place<alloc::boxed::Box<(dyn futures_core::stream::Stream<Item=core::result::Result<iroh::discovery::DiscoveryItem, anyhow::Error>> + core::marker::Send), alloc::alloc::Global>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:523
#5  drop_in_place<core::pin::Pin<alloc::boxed::Box<(dyn futures_core::stream::Stream<Item=core::result::Result<iroh::discovery::DiscoveryItem, anyhow::Error>> + core::marker::Send), alloc::alloc::Global>>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:523
#6  0x00005570405e281c in {async_fn#0} () at src/discovery.rs:426
#7  0x00005570405d39fb in {async_block#0} () at src/discovery.rs:339
#8  poll<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tracing-0.1.41/src/instrument.rs:321
#9  0x0000557040695db8 in {closure#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.0/src/runtime/task/core.rs:331
#10 with_mut<tokio::runtime::task::core::Stage<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.0/src/loom/std/unsafe_cell.rs:16
#11 poll<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.0/src/runtime/task/core.rs:320
#12 0x000055704057f2e2 in {closure#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> ()
    at /home/dpc/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.43.0/src/runtime/task/harness.rs:532
#13 call_once<core::task::poll::Poll<()>, tokio::runtime::task::harness::poll_future::{closure_env#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272
#14 do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/std/src/panicking.rs:587
#15 try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/std/src/panicking.rs:550
#16 catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> ()
    at /nix/store/gafflna2cvyiq1gymrkxgyqx1lcm94ma-rust-mixed/lib/rustlib/src/rust/library/std/src/panic.rs:359
#17 poll_future<tracing::instrument::Instrumented<iroh::discovery::{impl#4}::maybe_start_after_delay::{async_block_env#0}>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> ()
> cargo tree -i -p cordyceps
cordyceps v0.3.2
└── futures-buffered v0.2.9
    ├── n0-future v0.1.2
    │   ├── iroh v0.32.1
    │   │   ├── rostra v0.1.0 (/home/dpc/lab/rostra/crates/rostra)

It's unclear to me which of the chain of dependencies is at fault (probably cordyceps?), but I'm dutifully reporting .

Platform(s)
Linux x86_64, NixOS, 24.11

@hawkw
Copy link
Owner

hawkw commented Feb 24, 2025

When a thread in this program enters this state, does it remain stuck here forever, or does it eventually get unstuck?

@hawkw
Copy link
Owner

hawkw commented Feb 24, 2025

It might also be worth investigating the topmost frame in this stack (the Drop impl for futures_buffered::arc_slice::ArcSlotInner). When the cordyceps::MpscQueue is dropped, it must drop any messages currently in the queue. It's possible that either the MpscQueue Drop impl is spinning forever, but it also seems possible that there's an item in the queue whose own Drop implementation is spinning forever.

@hawkw
Copy link
Owner

hawkw commented Feb 24, 2025

Honestly, it might be a good idea to open an issue on futures-buffered. I haven't looked all that closely, but if I had to guess, I would suspect that this behavior is related to something in futures-buffered (which depends on cordyceps), rather than iroh, which depends on futures-buffered. To be clear, I don't have a particular hypothesis for what's going on here that suggests it's futures-buffered's fault, or anything like that. But, I imagine the authors of that crate might have some insight to offer here.

@dpc
Copy link
Author

dpc commented Feb 24, 2025

Sure thing: conradludgate/futures-buffered#10

@conradludgate
Copy link

I have not confirmed it, but it's a possibility that I was queueing the entry twice in the same queue. I had put some effort to ensuring exclusive access, but it might not have been sufficient.

@hawkw
Copy link
Owner

hawkw commented Feb 24, 2025

@conradludgate Queueing the same entry twice does seem like a likely explanation — we could probably get stuck if the linked list for the MPSC queue contains an entry that's linked back to itself. Let me know if you're able to confirm that was the problem, and I'll close this issue. If not, I'm happy to help with further investigation.

@conradludgate
Copy link

Seems we have resolved the issue. It was indeed related to double queuing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants