Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event severity and alerting #125

Open
neilstuartcraig opened this issue Apr 23, 2020 · 5 comments
Open

Event severity and alerting #125

neilstuartcraig opened this issue Apr 23, 2020 · 5 comments

Comments

@neilstuartcraig
Copy link
Contributor

Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.

Whilst NEL is clearly incredibly powerful and extremely useful, the biggest issue we have with NEL is that we don’t really know how to make use of the data it generates. From my conversations within the BBC and also some with external folks, this seems to be a common sentiment.

Ideally, we’d be able to slot NEL events into one of (probably) two buckets:

  1. Unrecoverable, critical events which we’d alert on
  2. Recoverable/informational events which would be available for improvement and triage work

The reasons for these are, in my opinion, that there is currently no discriminator on NEL events to state whether they’re:

  1. Recovered-from (or not)
  2. error or info severity

Recovered-from (or not)
By way of illustration, consider the dns event class and, for example, dns.unreachable which is described as “DNS server is unreachable”. Typically, multiple DNS nameservers are listed in NS records so does this event mean that say, 1 of 4 nameservers were unreachable or does it mean that all 4 were unreachable? If the former, that’s interesting but almost certainly didn’t impact the user too much, so whilst a website operator would want to have the information available, it’s not something to jump on and fix immediately, whereas the latter would be much more serious and would need to fire an alert to be fixed ASAP.
The same sort of issue is true of most of the dns, tcp and some of the http event classes.

error or info severity
As per the above, some NEL events severity depends on whether or not they’re recoverable, some others are either always high-severity (e.g. tls events) or always low-severity (e.g. unknown and abandoned , since they’re non-deterministic).

Both of these issues could be addressed by adding a severity property to the NEL event which would depend on whether or not the event was recovered-from or not and also the event type itself as follows:

    let severity = `info`;
    if(event.isRecoveredFrom == false || event.class.alwaysHighSeverity == true) {
        severity = `error`;
    }

+@chrisn

@aaronpeters
Copy link

Adding my 2 cents in what turned out to be a lengthy comment.

I have not yet used NEL in production and not at the scale of the BBC, but I do have experience with collecting CDN perf data at large scale with RUM and building a platform that makes sense of this data in real-time for the purpose of switching from one CDN to another.
The challenge there was to determine from the data if a degradation in performance of a CDN was severe enough to justify a switch to another CDN. Switch not too soon, but also not too late.
We ended up with tracking Fail Ratio and Response Time in multiple rolling windows (5min, 10min, 60min, 24hrs) at the country/ASN level and applied logic to that. Tuning the logic was an ongoing effort.
The internet is flaky and RUM is noisy, so in order to have high confidence in the derived insights (= correctly classify the severity) you need a sufficient number of measurements from a sufficient number of different clients. The latter is very important.

Enough of the background, but wanted to let you know where I'm coming from and I believe it is relevant here.

Recovered-from (or not)
A single browser instance (= browser of one user) would need to track 'state' and have logic to determine when the issue has been resolved, per origin and for each NEL event class (tcp.refused, dns.unreachable, http.error, etc).
How would this work?

Imagine ... the user visits www.bbc.co.uk and the `http.error' event is logged for one of the subresources.
Assuming this event does not fall in the always high-severity bucket (more on this at bottom of this comment), what should the browser do?

Do retries in the background until isRecoveredFrom can be set to true?
Assuming this is doable: when should the browser send what report?
The browser can't send a report with severity=error immediately after the event first fired, so send first send a report with severity=info and later send another report with severity=error after a few retries failed too?

Without immediately initiating retries in the background, in many cases it will not be possible to 'upgrade' the severity from info to error because the user does not revisit the same site for a long time and with too much time between two fetches it's not possible to make a statement like 'the fetch failed again, so this must be serious!'

error or info severity

As per the above, some NEL events severity depends on whether or not they’re recoverable, some others are either always high-severity (e.g. tls events) or always low-severity (e.g. unknown and abandoned, since they’re non-deterministic)

I agree certain events can always be classified as severe while others are not always severe.
Is it helpful to have the spec make statements about the severity of events and consequently have browsers add the severity property to the NEL reports?
I'm not sure.
Imho, most of the predefined network error types should be classified as always severe:

  • all DNS resolution errors except for dns.address_changed
  • all the secure connection establishment errors (tcp. and tls.)
  • all transmission of request and response errors except for http.error, abandoned and unknown

However, for some of these it's fair to say 'it depends'.
E.g. for http.error the severity depends on the response status code and on the URL.
The browser receiving a 404 response for an image subresource is not as severe as that 500 for the page document.
Maybe something can be done browser-side (e.g. 'navigation' fetches failing are always severe).

In summary:

  • recoveredFrom is hard to implement in the browser in a way that will be helpful to the website operator: as a website operator, you need a sufficient amount of error reports from a variety of users to be confident a (big) issue started happening and when it was resolved.
  • adding a severity property to NEL reports ... maybe useful, but not a must-have imho because it's not hard to put most of the current predefined network error types in the 'always severe' bucket and the list is pretty static so not hard to maintain.

@neilstuartcraig
Copy link
Contributor Author

Thanks again @aaronpeters, i'll try to respond as best I can with my views:

A single browser instance (= browser of one user) would need to track 'state' and have logic to determine when the issue has been resolved, per origin and for each NEL event class (tcp.refused, dns.unreachable, http.error, etc).
How would this work?

The browser knows when it's run out of e.g. TCP retries as it presents e.g. "connection failed" to the user, same with TLS etc. So in this sense, it is tracking the relevant state - unless I am misunderstanding.

Imagine ... the user visits www.bbc.co.uk and the `http.error' event is logged for one of the subresources.
Assuming this event does not fall in the always high-severity bucket (more on this at bottom of this comment), what should the browser do?

I am not sure what you're asking here, sorry if i am being a bit dim. Some clarity which may help - I am specifically suggesting that not all event types are are possibly both recoverable and non-recoverable, some will only be one of the two.

Do retries in the background until isRecoveredFrom can be set to true?
Assuming this is doable: when should the browser send what report?
The browser can't send a report with severity=error immediately after the event first fired, so send first send a report with severity=info and later send another report with severity=error after a few retries failed too?

I am not suggesting changing any behaviour, basically just firing "recoverable" events for each attempt, then "non-recovered" for the ultimate hard fail, when the browser gives up. Does that help clarify?

I agree certain events can always be classified as severe while others are not always severe.
Is it helpful to have the spec make statements about the severity of events and consequently have browsers add the severity property to the NEL reports?
I'm not sure.
Imho, most of the predefined network error types should be classified as always severe:

all DNS resolution errors except for dns.address_changed
all the secure connection establishment errors (tcp. and tls.)
all transmission of request and response errors except for http.error, abandoned and unknown

Yes, I agree - with an addendum: currently, the NEL spec does not define whether e.g. dns.failed should be reported on for every lookup attempt or whether it should only be reported on if all lookups for a record failed. This is the crux of the issue, I should probably have included this in the original message.

Hopefully this helps to explain my thinking a little better? That last paragraph above this one is the most important, I think.

IMO, NEL is essentially a logging system which doesn't have a severity flag. This is what (again, IMO) makes it hard to use in terms of urgent vs informational reports.

Cheers!

@dcreager
Copy link
Member

the NEL spec does not define whether e.g. dns.failed should be reported on for every lookup attempt or whether it should only be reported on if all lookups for a record failed. This is the crux of the issue, I should probably have included this in the original message.

We might need to clarify the text, but we've tried to define this — we talk about how there should be zero or one report (depending on sampling rates) for each "network request", and clarify that if the user agent defines its behavior using the Fetch standard, that this corresponds to a single execution of the "HTTP-network-fetch" algorithm.

That means that dns.failed implies that the user agent was not able to get any valid answer for its DNS lookup, regardless of how many times it issues an actual DNS request to the local resolver. And in the Happy Eyeballs case (try both IPv4 and IPv6 at the same time), that still counts as a single HTTP-network-fetch, and so you'd get a success report if either the IPv4 or IPv6 connection succeeded.

Each step in a redirect chain, though, turns into a separate HTTP-network-fetch, and so you would get separate reports for a 304 redirect and the following (hopefully) 200 response. (And both would be classified as successes, and subject to the success sampling rate.)

@aaronpeters
Copy link

Thanks @dcreager , these clarifications are helpful and confirm my interpretation of what for example dns.failed means.

@neilstuartcraig
Copy link
Contributor Author

Yeah, thanks @dcreager - that definitely matches what I was hoping for (basically only logging unrecovered-from events - i.e. errors in the SRE sense). I do think there's some room for further clarification in the spec as I, some colleagues and some folks familiar with standards read the spec are were unsure.
I'd be happy to propose some wording if that would help as a starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants