Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tracing warning when a thread blocks steps #162

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Benjscho
Copy link
Collaborator

@Benjscho Benjscho commented Jan 5, 2024

Add a warning to the sim when a given host or client blocks progress in a simulation run. This works by spawning a background thread for each run that periodically checks the steps taken by the simulation. If the number of steps is the same between checks then the thread adds the tracing info.

Few questions here:

  • Is 10s the right amount of time? Should this be configurable?
  • Is tracing warning the right way of displaying this info? Any print warnings are swallowed when running cargo test without --nocapture

Fixes #160

@Benjscho
Copy link
Collaborator Author

Benjscho commented Jan 5, 2024

@LucioFranco would you be able to review?

Copy link
Member

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good I have just two small things we should fix/think about

src/sim.rs Outdated

if self.elapsed > self.config.duration && !is_finished {
return Err(format!(
"Ran for duration: {:?} steps: {} without completing",
self.config.duration, self.steps,
self.config.duration, self.steps.load(Relaxed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this load because fetch_add above returns the previous value so you can just +1 that and get the current value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, will adjust that!

src/sim.rs Outdated
Ok(_) | Err(TryRecvError::Disconnected) => break,
_ => {}
}
std::thread::sleep(Duration::from_secs(10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to add a limit or make this exponentially backoff so that the noise of it is reduced in say a ci scenario where it may timeout and swamp up the logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, we could log it only once per step with the step it's stuck on and then skip the log after that. I'd think something's likely broken if a single step is taking 10s of real time work

Add a warning to the sim when a given host or client blocks progress in
a simulation run. This works by spawning a background thread for each
run that periodically checks the steps taken by the simulation. If the
number of steps is the same between checks then the thread adds the
tracing info.
Copy link
Contributor

@mcches mcches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you a consider a strategy where this is fallible? If X duration of real time elapses and the sim doesn't progress, fail the test. This saves operators from cancelling run away builds.

src/sim.rs Outdated
loop {
let prev = steps.load(std::sync::atomic::Ordering::Relaxed);
// Exit if main thread has.
match rx.try_recv() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this behave when you call run() N times in a row? It looks like you could spawn a ton of threads that don't clean up for 10s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll adjust it so it uses recv_timeout - that way if sim exits early the background thread will be closed too and it'll clean up straight away

src/sim.rs Outdated
loop {
let is_finished = self.step()?;

if is_finished {
let _ = tx.send(());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the drop handle this for you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

@mcches
Copy link
Contributor

mcches commented Jan 10, 2024

Did you a consider a strategy where this is fallible? If X duration of real time elapses and the sim doesn't progress, fail the test. This saves operators from cancelling run away builds.

I'll step back on this, and instead we can expect CI systems running these to set the timeout to be less oppinionated. For example, nextest is pretty great for configuring this.

@Benjscho
Copy link
Collaborator Author

@mcches or @LucioFranco would you mind giving this another look when you have a min? I've addressed all prev comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add warning for blocking tasks that block the sim
3 participants