Skip to content

Event severity and alerting #125

Open
@neilstuartcraig

Description

@neilstuartcraig

Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.

Whilst NEL is clearly incredibly powerful and extremely useful, the biggest issue we have with NEL is that we don’t really know how to make use of the data it generates. From my conversations within the BBC and also some with external folks, this seems to be a common sentiment.

Ideally, we’d be able to slot NEL events into one of (probably) two buckets:

  1. Unrecoverable, critical events which we’d alert on
  2. Recoverable/informational events which would be available for improvement and triage work

The reasons for these are, in my opinion, that there is currently no discriminator on NEL events to state whether they’re:

  1. Recovered-from (or not)
  2. error or info severity

Recovered-from (or not)
By way of illustration, consider the dns event class and, for example, dns.unreachable which is described as “DNS server is unreachable”. Typically, multiple DNS nameservers are listed in NS records so does this event mean that say, 1 of 4 nameservers were unreachable or does it mean that all 4 were unreachable? If the former, that’s interesting but almost certainly didn’t impact the user too much, so whilst a website operator would want to have the information available, it’s not something to jump on and fix immediately, whereas the latter would be much more serious and would need to fire an alert to be fixed ASAP.
The same sort of issue is true of most of the dns, tcp and some of the http event classes.

error or info severity
As per the above, some NEL events severity depends on whether or not they’re recoverable, some others are either always high-severity (e.g. tls events) or always low-severity (e.g. unknown and abandoned , since they’re non-deterministic).

Both of these issues could be addressed by adding a severity property to the NEL event which would depend on whether or not the event was recovered-from or not and also the event type itself as follows:

    let severity = `info`;
    if(event.isRecoveredFrom == false || event.class.alwaysHighSeverity == true) {
        severity = `error`;
    }

+@chrisn

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions