-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event severity and alerting #125
Comments
Adding my 2 cents in what turned out to be a lengthy comment. I have not yet used NEL in production and not at the scale of the BBC, but I do have experience with collecting CDN perf data at large scale with RUM and building a platform that makes sense of this data in real-time for the purpose of switching from one CDN to another. Enough of the background, but wanted to let you know where I'm coming from and I believe it is relevant here. Recovered-from (or not) Imagine ... the user visits www.bbc.co.uk and the `http.error' event is logged for one of the subresources. Do retries in the background until Without immediately initiating retries in the background, in many cases it will not be possible to 'upgrade' the severity from
I agree certain events can always be classified as severe while others are not always severe.
However, for some of these it's fair to say 'it depends'. In summary:
|
Thanks again @aaronpeters, i'll try to respond as best I can with my views:
The browser knows when it's run out of e.g. TCP retries as it presents e.g. "connection failed" to the user, same with TLS etc. So in this sense, it is tracking the relevant state - unless I am misunderstanding.
I am not sure what you're asking here, sorry if i am being a bit dim. Some clarity which may help - I am specifically suggesting that not all event types are are possibly both recoverable and non-recoverable, some will only be one of the two.
I am not suggesting changing any behaviour, basically just firing "recoverable" events for each attempt, then "non-recovered" for the ultimate hard fail, when the browser gives up. Does that help clarify?
Yes, I agree - with an addendum: currently, the NEL spec does not define whether e.g. Hopefully this helps to explain my thinking a little better? That last paragraph above this one is the most important, I think. IMO, NEL is essentially a logging system which doesn't have a severity flag. This is what (again, IMO) makes it hard to use in terms of urgent vs informational reports. Cheers! |
We might need to clarify the text, but we've tried to define this — we talk about how there should be zero or one report (depending on sampling rates) for each "network request", and clarify that if the user agent defines its behavior using the Fetch standard, that this corresponds to a single execution of the "HTTP-network-fetch" algorithm. That means that Each step in a redirect chain, though, turns into a separate HTTP-network-fetch, and so you would get separate reports for a 304 redirect and the following (hopefully) 200 response. (And both would be classified as successes, and subject to the success sampling rate.) |
Thanks @dcreager , these clarifications are helpful and confirm my interpretation of what for example |
Yeah, thanks @dcreager - that definitely matches what I was hoping for (basically only logging unrecovered-from events - i.e. errors in the SRE sense). I do think there's some room for further clarification in the spec as I, some colleagues and some folks familiar with standards read the spec are were unsure. |
Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.
Whilst NEL is clearly incredibly powerful and extremely useful, the biggest issue we have with NEL is that we don’t really know how to make use of the data it generates. From my conversations within the BBC and also some with external folks, this seems to be a common sentiment.
Ideally, we’d be able to slot NEL events into one of (probably) two buckets:
The reasons for these are, in my opinion, that there is currently no discriminator on NEL events to state whether they’re:
error
orinfo
severityRecovered-from (or not)
By way of illustration, consider the
dns
event class and, for example,dns.unreachable
which is described as “DNS server is unreachable”. Typically, multiple DNS nameservers are listed in NS records so does this event mean that say, 1 of 4 nameservers were unreachable or does it mean that all 4 were unreachable? If the former, that’s interesting but almost certainly didn’t impact the user too much, so whilst a website operator would want to have the information available, it’s not something to jump on and fix immediately, whereas the latter would be much more serious and would need to fire an alert to be fixed ASAP.The same sort of issue is true of most of the
dns
,tcp
and some of thehttp
event classes.error
orinfo
severityAs per the above, some NEL events severity depends on whether or not they’re recoverable, some others are either always high-severity (e.g.
tls
events) or always low-severity (e.g.unknown
andabandoned
, since they’re non-deterministic).Both of these issues could be addressed by adding a
severity
property to the NEL event which would depend on whether or not the event was recovered-from or not and also the event type itself as follows:+@chrisn
The text was updated successfully, but these errors were encountered: