Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service notifications despite parent host being down #873

Open
djerveren opened this issue Aug 1, 2022 · 11 comments
Open

Service notifications despite parent host being down #873

djerveren opened this issue Aug 1, 2022 · 11 comments
Labels

Comments

@djerveren
Copy link

Hello all.

Recently we noticed some weirdness regarding child/parent relationships and notifications where two services notified despite the host's parent being down during a VPN connection drop. The parent/child configuration of hosts is as follows:

Nagios -> prod-gw -> prod-mssql-1

Where obviously the Nagios machine is parent of prod-gw, and prod-gw is parent of prod-mssql-1.

Here it all began. No notifications at this point, since the host went into SOFT DOWN before the HARD CRITICAL of the second service:

[2022-07-20 21:12:14] SERVICE ALERT: prod-mssql-1;SQL Backups ProductionLog;CRITICAL;SOFT;1;CHECK_NRPE STATE CRITICAL: Socket timeout after 30 seconds.
[2022-07-20 21:12:24] HOST ALERT: prod-mssql-1;DOWN;SOFT;1;CRITICAL - 10.128.0.6: rta nan, lost 100%
[2022-07-20 21:13:52] SERVICE ALERT: prod-mssql-1;SQL Hangfire Processing Jobs older than 30min;CRITICAL;HARD;1;(Service check timed out after 60.01 seconds)

Nagios picks up that the prod-gw is in SOFT DOWN and sets prod-mssql-1 to UNREACHABLE:

[2022-07-20 21:14:53] HOST ALERT: prod-gw;DOWN;SOFT;1;CRITICAL - 10.128.0.1: rta nan, lost 100%
[2022-07-20 21:15:03] HOST ALERT: prod-mssql-1;UNREACHABLE;SOFT;4;CRITICAL - 10.128.0.6: rta nan, lost 100%

A few minutes later, the prod-gw host finally reaches HARD DOWN and a notification is correctly sent out:

[2022-07-20 21:19:26] HOST NOTIFICATION: slack-as;prod-gw;DOWN;notify-host-by-slack;CRITICAL - 10.128.0.1: rta nan, lost 100%
[2022-07-20 21:19:26] HOST ALERT: prod-gw;DOWN;HARD;5;CRITICAL - 10.128.0.1: rta nan, lost 100%

So far, everything has behaved as expected, no undesired notifications.

But after a while, the VPN connection is restored, and Nagios happens to mark prod-mssql-1 as UP before anything else:
[2022-07-20 21:33:34] HOST ALERT: prod-mssql-1;UP;HARD;1;OK - 10.128.0.6 rta 9.971ms lost 0%

This results in notifications for the two service at the top being sent out, activating our on-call staff for no reason:

[2022-07-20 21:33:40] SERVICE NOTIFICATION: slack-as;prod-mssql-1;SQL Backups ProductionLog;CRITICAL;notify-service-by-slack;CHECK_NRPE STATE CRITICAL: Socket timeout after 30 seconds.
[2022-07-20 21:33:40] SERVICE NOTIFICATION: on-call;prod-mssql-1;SQL Backups ProductionLog;CRITICAL;notify-service-by-zenduty;CHECK_NRPE STATE CRITICAL: Socket timeout after 30 seconds.
[2022-07-20 21:33:50] SERVICE NOTIFICATION: slack-as;prod-mssql-1;SQL Hangfire Processing Jobs older than 30min;CRITICAL;notify-service-by-slack;(Service check timed out after 60.01 seconds)
[2022-07-20 21:33:50] SERVICE NOTIFICATION: on-call;prod-mssql-1;SQL Hangfire Processing Jobs older than 30min;CRITICAL;notify-service-by-zenduty;(Service check timed out after 60.01 seconds)

Finally, prod-gw is checked and considered UP again, and correctly notifies about it:

[2022-07-20 21:34:25] HOST NOTIFICATION: slack-as;prod-gw;UP;notify-host-by-slack;OK - 10.128.0.1 rta 5.076ms lost 0%
[2022-07-20 21:34:25] HOST ALERT: prod-gw;UP;HARD;1;OK - 10.128.0.1 rta 5.076ms lost 0%

Shouldn't those service notifications be suppressed regardless of the state of the host prod-mssql-1? From a parent/child perspective, prod-mssql-1 was still UNREACHABLE (or should be considered to be from a notification-perspective) due to prod-gw still being considered DOWN, and it was simply a race condition that caused these notifications, and in turn an unnecessary activation of on-call resources.

$ rpm -qa |grep nagios
nagiosxi-nagioscore-5.8.9-1.el8.x86_64
nagiosxi-nsca-5.8.9-1.el8.x86_64
nagiosxi-shellinabox-5.8.9-1.el8.x86_64
nagiosxi-nrds-5.8.9-1.el8.x86_64
nagiosxi-nagvis-5.8.9-1.el8.x86_64
nagiosxi-ndoutils-5.8.9-1.el8.x86_64
nagiosxi-nxti-5.8.9-1.el8.x86_64
nagiosxi-wkhtmltox-5.8.9-1.el8.x86_64
nagiosxi-5.8.9-1.el8.x86_64
nagiosxi-nagiosplugins-5.8.9-1.el8.x86_64
nagiosxi-wmic-5.8.9-1.el8.x86_64
nagiosxi-mrtg-5.8.9-1.el8.x86_64
nagiosxi-nrpe-5.8.9-1.el8.x86_64
nagiosxi-pnp-5.8.9-1.el8.x86_64
@ne-bbahn
Copy link

ne-bbahn commented Jul 1, 2024

It seems someone else is experiencing similar issues where their child hosts are still giving off service notifications even though the parent is down/unreachable.
https://support.nagios.com/forum/viewtopic.php?p=357558

@ne-bbahn ne-bbahn added the Bug label Jul 1, 2024
@tonoitp
Copy link

tonoitp commented Jul 1, 2024

It seems someone else is experiencing similar issues where their child hosts are still giving off service notifications even though the parent is down/unreachable. https://support.nagios.com/forum/viewtopic.php?p=357558

On a

  • Debian GNU/Linux 12 (bookworm)
  • Nagios® Core™, Version 4.5.3, June 11, 2024 built from source

@tonoitp
Copy link

tonoitp commented Jul 2, 2024

Well ..... It's working now as expected. The childs show up as unreachable en their services are no longer critical.
I will try to reproduce it later, but I THINK the sequence was:

  • Added all hosts (two from two subnets; production and lab)
  • Turned off the lab network
  • Defined Parent/child relations & restarted nagios
    -> lab shows unreachable with critical services
  • Serveral nagios restarts/server reboots ; no change
  • Turned on lab network
    -> Everything shows okay
  • Turned off the lab network
    -> lab devices shows unreachable, their services are no longer reported under "services problems" If you go to the host, the services are shown Ok (or just the last state, not sure)

So it does work as expected for me (after all :D )

@djerveren
Copy link
Author

djerveren commented Jul 3, 2024

@tonoitp If you wish to re-create this, I believe the hosts/services on the lab network must be detected as UP/OK in a specific order.

Let's say that you have the following relationship, assuming gw1 and host1 are on the lab network: Nagios -> gw1 -> host1.

When performing the recovery on the lab network, you must make sure that host1 and its services are detected by Nagios as UP/OK before gw1. At least that was the scenario that triggered the false notifications in my case.

I suppose you can create this scenario either by increasing check_interval of gw1, or simply manually forcing a check of host and services on host1 in order for Nagios to pick up its change in status before it detects that gw1 is up.

@tonoitp
Copy link

tonoitp commented Jul 10, 2024

Well, I re-used an test server I had, and found it's still a bit off.

  • Stop nagios process
  • Delete all status files in var folder
  • Copied cfg files with parent/child config
  • Started LAB devices
  • 5 min. later started nagios
  • 10 min. later all shows OK/green
  • 15 min later shutdown lab devices

Five hosts show as unreachable which is correct. But ...
Four of those five unreachables are connected identical, but only two of them show up with critical services. (one with the ping service, the other on an application port)
The fifth device shows 3 of it's 9 services as critical. After start/stop of the lab, which and the number of critical services vary, but till now, it's always the same hosts that have a service as critical.

@tonoitp
Copy link

tonoitp commented Jul 10, 2024

When performing the recovery on the lab network, you must make sure that host1 and its services are detected by Nagios as UP/OK before gw1. At least that was the scenario that triggered the false notifications in my case.

@djerveren Thnx for the tip, But IF that is a requirement, should nagios not take care of running detection in the right order? It's made aware of the dependencies.
Also, given the test above, the 5th device show only critical on 3 of the 9 service so it does not feel like that's the case.

@djerveren
Copy link
Author

djerveren commented Jul 11, 2024

As far as I know, after a HARD DOWN/CRITICAL, Nagios just keeps running checks based on the check_interval, and considering that in my case the child host was marked as UP at 21:33:34 and the parent host at 21:34:25 - almost a minute later - I think that it shows that triggered checks are not happening in the UP-direction. The reason for this is likely that there is no max_check_attempts for the UP/OK states - they are always immediately HARD, so how would Nagios determine "an appropriate time" to delay notifications while it checks the parent/child chain to determine if parents are UP or not?

Regardless, if the parent (prod-gw) is in a DOWN state, Nagios shouldn't send notifications for CRITICAL services on the child host (prod-mssql-1), period.

But in my case it did. So I went here to report it.

@tsadpbb
Copy link
Contributor

tsadpbb commented Jul 15, 2024

I see this check in check_service_notification_viability

	/* if all parents are bad (usually just one), we shouldn't notify */
	if(svc->parents) {
		int bad_parents = 0, total_parents = 0;
		servicesmember *sm;
		for(sm = svc->parents; sm; sm = sm->next) {
			/* @todo: tweak this so it handles hard states and whatnot */
			if(sm->service_ptr->current_state == STATE_OK)
				bad_parents += !!sm->service_ptr->current_state;
			total_parents++;
			}
		if(bad_parents == total_parents) {
			log_debug_info(DEBUGL_NOTIFICATIONS, 1, "This service has no good parents, so notification will be blocked.\n");
			return ERROR;
			}
		}

But I do not see something equivalent with the parent_hosts value

I'm curious whether a few of the above issues could be fixed with host and service dependencies

Regardless, I'm curious as to why a similar check was never added. It seems that some people are aware of this, as this is in the documentation at the bottom

By default, Nagios Core will notify contacts about both DOWN and UNREACHABLE host states. As an admin/tech, you might not want to get notifications about hosts that are UNREACHABLE. You know your network structure, and if Nagios Core notifies you that your router/firewall is down, you know that everything behind it is unreachable.
If you want to spare yourself from a flood of UNREACHABLE notifications during network outages, you can exclude the unreachable (u) option from the notification_options directive in your host definitions and/or the host_notification_options directive in your contact definitions.

@tsadpbb
Copy link
Contributor

tsadpbb commented Jul 15, 2024

In summary, I think the service notifications are being sent because there is never any check that propagates upward to check the parents of the parents and so on. It just checks if it's host is up or if it's service parents are up (amongst other things)

Basically what @djerveren said

@tonoitp If you wish to re-create this, I believe the hosts/services on the lab network must be detected as UP/OK in a specific order.

Let's say that you have the following relationship, assuming gw1 and host1 are on the lab network: Nagios -> gw1 -> host1.

When performing the recovery on the lab network, you must make sure that host1 and its services are detected by Nagios as UP/OK before gw1. At least that was the scenario that triggered the false notifications in my case.

I suppose you can create this scenario either by increasing check_interval of gw1, or simply manually forcing a check of host and services on host1 in order for Nagios to pick up its change in status before it detects that gw1 is up.

@djerveren
Copy link
Author

djerveren commented Jul 16, 2024

In summary, I think the service notifications are being sent because there is never any check that propagates upward to check the parents of the parents and so on. It just checks if it's host is up or if it's service parents are up (amongst other things)

I kind of agree, but since it doesn't perform any propagated upwards checks, it should then remember that prod-gw was still HARD DOWN (at least as far as Nagios was aware), which means that prod-mssql-1 should be considered UNREACHABLE and simply suppress the service recovery notifications based on that fact alone.

I remember when working at op5 many years ago, our devs made Merlin suppress recovery notifications for hosts/services it hadn't sent out problem notifications for, which would've helped in this case.

@tsadpbb
Copy link
Contributor

tsadpbb commented Oct 2, 2024

From what I can gather, a possible solution appears in two parts.

  1. Making sure notifications don't alert when any of the ancestors are down. I worry a little bit about missed recovery notifications here.
  2. I see there is a configuration option called, host_down_disable_service_checks, maybe a similar mechanism may be desirable to prevent a host that has a parent that is down from being checked before the parent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants