Skip to content

Commit

Permalink
Add Honeycomb outage report (Oct 2 2023)
Browse files Browse the repository at this point in the history
Signed-off-by: Rui Chen <[email protected]>
  • Loading branch information
chenrui333 committed Oct 14, 2023
1 parent 239b5b8 commit aadef0f
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,8 @@

[Honeycomb](https://www.honeycomb.io/blog/incident-review-shepherd-cache-delays/). On September 8th, 2022, our ingest system went down repeatedly and caused interruptions for over eight hours. We will first cover the background behind the incident with a high-level view of the relevant architecture, how we tried to investigate and fix the system, and finally, we’ll go over some meaningful elements that surfaced from our incident review process.

[Honeycomb](https://www.honeycomb.io/blog/incident-review-what-comes-up-must-first-go-down/). On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which no data could be processed or accessed.

[incident.io](https://incident.io/blog/intermittent-downtime). A bad event (poison pill) in the async workers queue triggered unhandled panics that repeatedly crashed the app. This combined poorly with Heroku infrastructure, making it difficult to find the source of the problem. Applied mitigations that are generally interesting to people running web services, such as catching corner cases of Go panic recovery and splitting work by type/class to improve reliability.

[Indian Electricity Grid](https://web.archive.org/web/20220124104632/https://cercind.gov.in/2012/orders/Final_Report_Grid_Disturbance.pdf). One night in July 2012, a skewed electricity supply-demand profile developed when the northern grid drew a tremendous amount of power from the western and eastern grids. Following a series of circuit breakers tripping by virtue of under-frequency protection, the entire NEW (northern-eastern-western) grid collapsed due to the absence of islanding mechanisms. While the grid was reactivated after over 8 hours, similar conditions in the following day caused the grid to fail again. However, the restoration effort concluded almost 24 hours after the occurrence of the latter incident.
Expand Down

0 comments on commit aadef0f

Please sign in to comment.