Event severity and alerting

Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.

Whilst NEL is clearly incredibly powerful and extremely useful, the biggest issue we have with NEL is that we don’t really know how to make use of the data it generates. From my conversations within the BBC and also some with external folks, this seems to be a common sentiment.

Ideally, we’d be able to slot NEL events into one of (probably) two buckets:

1. Unrecoverable, critical events which we’d alert on
2. Recoverable/informational events which would be available for improvement and triage work

The reasons for these are, in my opinion, that there is currently no discriminator on NEL events to state whether they’re:

1. Recovered-from (or not)
2. `error` or `info` severity 

**Recovered-from (or not)** 
By way of illustration, consider the `dns` event class and, for example, `dns.unreachable` which is described as “DNS server is unreachable”. Typically, multiple DNS nameservers are listed in NS records so does this event mean that say, 1 of 4 nameservers were unreachable or does it mean that all 4 were unreachable? If the former, that’s interesting but almost certainly didn’t impact the user too much, so whilst a website operator would want to have the information available, it’s not something to jump on and fix immediately, whereas the latter would be much more serious and would need to fire an alert to be fixed ASAP.
The same sort of issue is true of most of the `dns`, `tcp` and some of the `http` event classes.

**`error` or `info` severity**
As per the above, some NEL events severity depends on whether or not they’re recoverable, some others are either always high-severity (e.g. `tls` events) or always low-severity (e.g. `unknown` and `abandoned` , since they’re non-deterministic). 

Both of these issues could be addressed by adding a `severity` property to the NEL event which would depend on whether or not the event was recovered-from or not and also the event type itself as follows:

```
    let severity = `info`;
    if(event.isRecoveredFrom == false || event.class.alwaysHighSeverity == true) {
        severity = `error`;
    }
```

+@chrisn 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Event severity and alerting #125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Event severity and alerting #125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions