Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update broken links and open external links in new tab #142

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN git clone https://github.com/pagerduty/mkdocs-theme-pagerduty \

# Set our working directory and user
WORKDIR /docs
RUN useradd -m --uid 1000 mkdocs
RUN sudo useradd -m --uid 1000 mkdocs
USER mkdocs

# Expose MkDocs server
Expand Down
2 changes: 1 addition & 1 deletion docs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ It is intended for on-call practitioners and those involved in an operational in

## Why do I need it?

Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Modern Incidents Response](https://www.pagerduty.com/platform/modern-incident-response/) can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time.
Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Modern Incidents Response](https://www.pagerduty.com/platform/modern-incident-response/){:target="_blank" } can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time.

## What is covered?

Expand Down
24 changes: 12 additions & 12 deletions docs/after/effective_post_mortems.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Writing an effective postmortem allows us to learn quickly from our mistakes and

* Make sure the timeline is an accurate representation of events.
* Describe any technical lingo/acronyms you use that newcomers may not understand.
* [Discuss how the incident fits into our understanding of the health and resiliency of the services affected](https://www.pagerduty.com/blog/postmortem-understand-service-reliability/).
* [Discuss how the incident fits into our understanding of the health and resiliency of the services affected](https://www.pagerduty.com/blog/postmortem-understand-service-reliability/){:target="_blank" }.

## Don'ts

Expand All @@ -21,8 +21,8 @@ Writing an effective postmortem allows us to learn quickly from our mistakes and
* Avoid the concept of "human error." This is related to the point above about "naming and shaming," but there's a subtle difference - very rarely is the mistake "rooted" in a human performing an action, there are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc) that can and should be addressed.
* Avoid "alternate reality" discussion in the timeline or description sections. "Service X started seeing elevated traffic early this morning, and stopped responding to requests. _*If service X had*_ rate limited the requests from the customer, _*it would not have*_ failed." & "Service X began slowly responding to requests this evening, _*there was insufficient monitoring*_ to detect the elevated CPU usage." as two examples, blends describing the actual problem with a hypothetical fix - keep the improvements separate from the description, so that each can be appropriately discussed.
* These videos go into more detail on the above points:
* "[Three analytical traps in accident investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U)"
* "[Two views on Human Error](https://www.youtube.com/watch?v=rHeukoWWtQ8)"
* "[Three analytical traps in accident investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U){:target="_blank" }"
* "[Two views on Human Error](https://www.youtube.com/watch?v=rHeukoWWtQ8){:target="_blank" }"

## Reviewing

Expand All @@ -40,15 +40,15 @@ Reviewing a postmortem isn't about nit-picking typos (although we should make su
## Examples
Here are some examples of postmortems from other companies as a reference,

* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/)
* [AWS](https://aws.amazon.com/message/5467D2/)
* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html)
* [Heroku](https://status.heroku.com/incidents/151)
* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016)
* [A List of Postmortems!](https://github.com/danluu/post-mortems)
* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/){:target="_blank" }
* [AWS](https://aws.amazon.com/message/5467D2/){:target="_blank" }
* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html){:target="_blank" }
* [Heroku](https://status.heroku.com/incidents/151){:target="_blank" }
* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5){:target="_blank" }
* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016){:target="_blank" }
* [A List of Postmortems!](https://github.com/danluu/post-mortems){:target="_blank" }

## Useful Resources

* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011)
* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/)
* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011){:target="_blank" }
* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/){:target="_blank" }
22 changes: 11 additions & 11 deletions docs/after/post_mortem_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Once you've been designated as the owner of a postmortem, you should start creat
* Identify the Incident Commander and Scribe in this list.

1. Populate the postmortem with more detailed information.
* For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
* For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a post, etc. Anything which shows the data point you're trying to illustrate in the timeline.
* Add a link to the incident call recording.

1. Perform an analysis of the incident.
Expand Down Expand Up @@ -98,23 +98,23 @@ A general agenda for the meeting would be something like,
1. Recap the timeline, to make sure everyone agrees and is on the same page.
1. Recap important points, and any unusual items.
1. Discuss how the problem could've been caught.
* Did it show up in [canary](https://www.pagerduty.com/blog/continuous-build-break-fix-fast#canary-releases)?
* Did it show up in [canary](https://www.pagerduty.com/blog/continuous-build-break-fix-fast#canary-releases){:target="_blank" }?
* Could it have been caught in tests, or loadtest environment?
1. Discuss customer impact. Any comments from customers, etc.
1. Review action items that have been created, discuss if appropriate, or if more are needed, etc.

## Examples
Here are some examples of postmortems from other companies as a reference,

* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/)
* [AWS](https://aws.amazon.com/message/5467D2/)
* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html)
* [Heroku](https://status.heroku.com/incidents/151)
* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016)
* [A List of Postmortems!](https://github.com/danluu/post-mortems)
* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/){:target="_blank" }
* [AWS](https://aws.amazon.com/message/5467D2/){:target="_blank" }
* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html){:target="_blank" }
* [Heroku](https://status.heroku.com/incidents/151){:target="_blank" }
* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5){:target="_blank" }
* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016){:target="_blank" }
* [A List of Postmortems!](https://github.com/danluu/post-mortems){:target="_blank" }

## Useful Resources

* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011)
* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/)
* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011){:target="_blank" }
* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/){:target="_blank" }
2 changes: 1 addition & 1 deletion docs/before/call_etiquette.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ You've just joined an incident call and you've never been on one before. You hav

![Communication](../assets/img/misc/communicate.png)

Standard radio [voice procedure](https://en.wikipedia.org/wiki/Radiotelephony_procedure#Procedure_words) does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are,
Standard radio [voice procedure](https://en.wikipedia.org/wiki/Radiotelephony_procedure#Procedure_words){:target="_blank" } does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are,

* **Ack/Rog** - "I have received and understood"
* **Say Again** - "Repeat your last message"
Expand Down
2 changes: 1 addition & 1 deletion docs/before/different_roles.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ All of the other roles will be actively working on identifying the cause and res
1. Drafting [external communication][ecg] messages when needed, picking the appropriate template, either when asked by the IC or at own initiative
1. Asking for more information / clarification if needed for clear communication
1. Regularly notify the IC of the number of customers reporting that they are affected by the incident. This can include providing specific customer references or examples for investigation purposes.
1. Post any publicly facing messages regarding the incident (Twitter, StatusPage, etc) once approved by the IC
1. Post any publicly facing messages regarding the incident (X, StatusPage, etc) once approved by the IC
1. Removing an ephemeral investigation message once approved by the IC
1. Provide customers with the external message from the postmortem once it is completed.

Expand Down
6 changes: 3 additions & 3 deletions docs/crisis/leadership.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ When your company’s values are at the forefront, your stakeholder communicatio

The unpredictable and fluid nature of a crisis requires situational awareness. Being aware of what you know and don't know is crucial. Continually monitoring the situation, predicting statuses and being prepared to roll with the changing environment makes your company adept at crisis response and provides your team with purpose, i.e., everyone is in sync and working towards the same goal.

An increasingly important aspect of Crisis Leadership is taking care of yourself and your team. Members of your crisis response team may have been impacted by the events but are still working to resolve it. Some of your team may have been awake for 24 hours needing someone to give them permission to step away. Fatigue may be setting in and so forth. Leveraging the functionality built into the PagerDuty platform to establish on-call rotations, hand-offs and integrate video conferencing technology like Zoom or Teams can help create a safe and healthy [on call culture](https://goingoncall.pagerduty.com/culture/) for your teams while responding to what could be a protracted situation.
An increasingly important aspect of Crisis Leadership is taking care of yourself and your team. Members of your crisis response team may have been impacted by the events but are still working to resolve it. Some of your team may have been awake for 24 hours needing someone to give them permission to step away. Fatigue may be setting in and so forth. Leveraging the functionality built into the PagerDuty platform to establish on-call rotations, hand-offs and integrate video conferencing technology like Zoom or Teams can help create a safe and healthy [on call culture](https://goingoncall.pagerduty.com/culture/){:target="_blank" } for your teams while responding to what could be a protracted situation.

![On-call Restrictions by day and hour](../assets/img/crisis/01_oncallrestrictions.png)

Expand Down Expand Up @@ -85,12 +85,12 @@ Once you’ve built your handful of scenarios, assigning members of your organiz
| Marketing campaign failure | Typo, untrue product claim, wrong tone | Chief Digital Officer | Communications Chief |


Using PagerDuty, you can build your [on-call schedule](https://support.pagerduty.com/docs/schedule-basics) right inside the platform providing visibility and accountability about who’s on call for what area of the business if a crisis situation takes place. You can also add backups using an escalation policy that alerts the next person up after a custom time delay.
Using PagerDuty, you can build your [on-call schedule](https://support.pagerduty.com/docs/schedule-basics){:target="_blank" } right inside the platform providing visibility and accountability about who’s on call for what area of the business if a crisis situation takes place. You can also add backups using an escalation policy that alerts the next person up after a custom time delay.

![Set escalation timeouts](../assets/img/crisis/02_escalationtimeout.png)


If you want to balance the load for your on-call team, the [round robin scheduling](https://support.pagerduty.com/docs/round-robin-scheduling) can help by alternating who’s the primary team member that’s notified for each crisis notification.
If you want to balance the load for your on-call team, the [round robin scheduling](https://support.pagerduty.com/docs/round-robin-scheduling){:target="_blank" } can help by alternating who’s the primary team member that’s notified for each crisis notification.

![Use round-robin scheduling](../assets/img/crisis/03_roundrobin.png)

Expand Down
Loading