potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

glassfishrobot · 2008-08-19T11:40:40Z

Bug was uncovered during a code review. The bug is a FAILURE notification could
be missed when 2 or more more instances are killed at same time. (Note that
given the race condition between node agent restarting a killed instance and the
failure notification, only a test that kills the node agent and then kills
instances can be assured of seeing a FALIURE_NOTIFICATION for each server
instance killed. A node agent can restart a server instance before shoal
reports it as FAILED.)

HealthMonitor.InDoubtPeerDetector.processCacheUpdate() iterates over all
instances in cluster checking if any are in doubt. If one instance is detected
to be indoubt, HealthMonitor.InDoubtPeerDetector.determineInDoubtPeers() notifies
the FailureVerifier thread to process current cache looking for InDoubtPeers to
verify which instance should have FAILURE_NOTIFICATION sent.

synchronized (verifierLock)

{ verifierLock.notify(); LOG.log(Level.FINER, "Done Notifying FailureVerifier for " + entry.adv.getName()); }

The notification signal from InDoubtPeerDetector thread to FailureVerifier
thread is the weak link in this bug. When multiple failures happen at once, the
code is currently written to act on the first instance failure immediately. The
InPeerDoubtDetector should iterate over all instances AND if one OR more
instances are in doubt, then it should notify the FailureVerifier thread to run
over all instances in cluster cache.

Bug could be that InDoubtPeerDetector, runs twice, one notifiying
FailureVerifier() to run on instance cache and it detects first killed instance.
The second time the InDoubtPeerDetector runs, it could notify the
FailureDetector while it is still working on verifiying first failure (with a
snap shotted cache). The second notify to a running FailureVerifier thread will
have no impact and the FAILURE_NOTIFICATION for the second killed server
instance will be detected much later when the next failure occurs or the client
is shutdown.

Environment

Operating System: All
Platform: All

Affected Versions

[current]

glassfishrobot · 2008-08-19T11:40:40Z

@glassfishrobot Commented
Reported by @jfialli

glassfishrobot · 2017-04-24T12:41:24Z

@glassfishrobot Commented
This issue was imported from java.net JIRA SHOAL-74

glassfishrobot · 2018-06-01T07:31:40Z

Issue Imported From: https://github.com/javaee/shoal/issues/74
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @jfialli

glassfishrobot added Component: GMS Priority: Minor Type: Bug labels Jun 1, 2018

glassfishrobot added this to the 1.1 milestone Jun 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

glassfishrobot commented Aug 19, 2008

glassfishrobot commented Aug 19, 2008

glassfishrobot commented Apr 24, 2017

glassfishrobot commented Jun 1, 2018

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time #74

Comments

glassfishrobot commented Aug 19, 2008

Environment

Affected Versions

glassfishrobot commented Aug 19, 2008

glassfishrobot commented Apr 24, 2017

glassfishrobot commented Jun 1, 2018