Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Rescheduling Default Interval/Window Incorrect #893

Open
blevans33 opened this issue Jan 18, 2023 · 5 comments
Open

Auto Rescheduling Default Interval/Window Incorrect #893

blevans33 opened this issue Jan 18, 2023 · 5 comments
Labels

Comments

@blevans33
Copy link

It seems to me that the following two values should be equal in the default config file:
auto_rescheduling_interval
auto_rescheduling_window
Otherwise, if the window is larger than the interval, there is in theory nothing from stopping a particular check from continuously getting rescheduled to the back of the window (If interval is 30 and window is 45, only the checks rescheduled for the next 30sec are guaranteed to be checked in the upcoming interval, whereas the checks in the final 15sec of the window can be rescheduled AGAIN!)

Is there anything wrong with setting interval=30, window=30?
More info: https://support.nagios.com/forum/viewtopic.php?f=7&t=65475

@blevans33 blevans33 changed the title Auto Rescheduling Default Windows Incorrect Auto Rescheduling Default Interval/Window Incorrect Jan 18, 2023
blevans33 referenced this issue Jan 18, 2023
adjust_check_scheduling() is intended to smooth out load by more evenly
scheduling checks, but wasn't updated to work with the new heap based
scheduling queue.

These changes closely replicate the original implementation while using
the new data structures reasonably efficiently, and providing
sub-second resolution when calculating new event run times.

The rescheduling algorithm makes some assumptions about per-check
overhead that may be overly pessimistic, and possibly not needed to
generate a smooth schedule. When rescheduled, the next run of an event
may be earlier or later than dictated by its check interval, but will
run at its regular check interval when no schedule adjustment is
needed.
blevans33 referenced this issue Jan 18, 2023
Previously we were looking at timed_event.run_time which has second
precision. This would cause rescheduling to be run only when events
occured in the same second.

By looking at squeue_event.when we get the actual run times used by the
event scheduling priority queue with microsecond precision.
@sawolf
Copy link
Contributor

sawolf commented Jan 25, 2023

Thanks for reaching out. Unfortunately, the answer right now is that I'm not sure - the code you were referencing is from >=4 maintainers ago and I haven't gotten deep into check scheduling recently.

In practice I've run some pretty large environments where this didn't seem to happen - even if nothing handles this case explicitly, there might be some implicit stuff in the auto-rescheduling where we "get lucky" and don't continuously procrastinate on checks. I have some vague ideas about why it might be fine but I'd rather read the code and give you a real answer instead of telling you some nonsense.

If I may ask, what prompted you to dig into this? Did you see a check (or checks) in your environment that get continuously rescheduled?

@blevans33
Copy link
Author

blevans33 commented Jan 27, 2023

Thanks for following up Sebastian @sawolf ! I think I got the idea that this could be a problem in the following page, where they note that there are potential problems with the default values, but the new values they recommend dont seem to fully solve the problem, conceptually (See section "The check is failing to be scheduled or executed"): https://nagios.force.com/support/s/article/Last-Check-Time-Not-Updating-4f7efc76
I have not seen checks get continuously rescheduled first-hand and I did not know where to start to try to reproduce/debug this potential problem, partially because I couldnt understand the code well enough.
BTW, I have been holding off on using the auto_rescheduling feature for our configuration because I dont have high confidence that it has been well vetted (the config file still indicates it is 'experimental', and it still uses the old 'buggy' default values). I wonder if you can say anything about the safety of this - do people use it, are there bugs to work out, is it worth it, etc?

@blevans33
Copy link
Author

Hi @sawolf have you had a chance to look into this further?
Just keeping it fresh!

@sawolf
Copy link
Contributor

sawolf commented Mar 20, 2023

Hi @blevans33 - no, I haven't had a chance to get into this yet.

@ranjithkodumbu
Copy link

Hi @sawolf , this is serious issue, even I observed this issue. Please have a look into it ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants