diff --git a/Dockerfile b/Dockerfile index 7702c58..6287aa8 100644 --- a/Dockerfile +++ b/Dockerfile @@ -13,7 +13,7 @@ RUN git clone https://github.com/pagerduty/mkdocs-theme-pagerduty \ # Set our working directory and user WORKDIR /docs -RUN useradd -m --uid 1000 mkdocs +RUN sudo useradd -m --uid 1000 mkdocs USER mkdocs # Expose MkDocs server diff --git a/docs/about.md b/docs/about.md index d942f64..6347998 100644 --- a/docs/about.md +++ b/docs/about.md @@ -17,7 +17,7 @@ It is intended for on-call practitioners and those involved in an operational in ## Why do I need it? -Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Modern Incidents Response](https://www.pagerduty.com/platform/modern-incident-response/) can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time. +Incident response is something you hope to never need, but when you do, you want it to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company will be built up over time, getting better with each incident. While tools such as [PagerDuty's Modern Incidents Response](https://www.pagerduty.com/platform/modern-incident-response/){:target="_blank" } can help you recover quickly, the process you follow is just as important. This documentation will allow you to learn from the start something which has taken us years to build up. Giving you a head start on how to deal with major incidents in a way which leads to the fastest possible recovery time. ## What is covered? diff --git a/docs/after/effective_post_mortems.md b/docs/after/effective_post_mortems.md index 9364a8d..3e50785 100644 --- a/docs/after/effective_post_mortems.md +++ b/docs/after/effective_post_mortems.md @@ -8,7 +8,7 @@ Writing an effective postmortem allows us to learn quickly from our mistakes and * Make sure the timeline is an accurate representation of events. * Describe any technical lingo/acronyms you use that newcomers may not understand. -* [Discuss how the incident fits into our understanding of the health and resiliency of the services affected](https://www.pagerduty.com/blog/postmortem-understand-service-reliability/). +* [Discuss how the incident fits into our understanding of the health and resiliency of the services affected](https://www.pagerduty.com/blog/postmortem-understand-service-reliability/){:target="_blank" }. ## Don'ts @@ -21,8 +21,8 @@ Writing an effective postmortem allows us to learn quickly from our mistakes and * Avoid the concept of "human error." This is related to the point above about "naming and shaming," but there's a subtle difference - very rarely is the mistake "rooted" in a human performing an action, there are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc) that can and should be addressed. * Avoid "alternate reality" discussion in the timeline or description sections. "Service X started seeing elevated traffic early this morning, and stopped responding to requests. _*If service X had*_ rate limited the requests from the customer, _*it would not have*_ failed." & "Service X began slowly responding to requests this evening, _*there was insufficient monitoring*_ to detect the elevated CPU usage." as two examples, blends describing the actual problem with a hypothetical fix - keep the improvements separate from the description, so that each can be appropriately discussed. * These videos go into more detail on the above points: - * "[Three analytical traps in accident investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U)" - * "[Two views on Human Error](https://www.youtube.com/watch?v=rHeukoWWtQ8)" + * "[Three analytical traps in accident investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U){:target="_blank" }" + * "[Two views on Human Error](https://www.youtube.com/watch?v=rHeukoWWtQ8){:target="_blank" }" ## Reviewing @@ -40,15 +40,15 @@ Reviewing a postmortem isn't about nit-picking typos (although we should make su ## Examples Here are some examples of postmortems from other companies as a reference, -* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/) -* [AWS](https://aws.amazon.com/message/5467D2/) -* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html) -* [Heroku](https://status.heroku.com/incidents/151) -* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5) -* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016) -* [A List of Postmortems!](https://github.com/danluu/post-mortems) +* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/){:target="_blank" } +* [AWS](https://aws.amazon.com/message/5467D2/){:target="_blank" } +* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html){:target="_blank" } +* [Heroku](https://status.heroku.com/incidents/151){:target="_blank" } +* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5){:target="_blank" } +* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016){:target="_blank" } +* [A List of Postmortems!](https://github.com/danluu/post-mortems){:target="_blank" } ## Useful Resources -* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011) -* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/) +* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011){:target="_blank" } +* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/){:target="_blank" } diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md index 52a7713..81c0c7c 100644 --- a/docs/after/post_mortem_process.md +++ b/docs/after/post_mortem_process.md @@ -46,7 +46,7 @@ Once you've been designated as the owner of a postmortem, you should start creat * Identify the Incident Commander and Scribe in this list. 1. Populate the postmortem with more detailed information. - * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline. + * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a post, etc. Anything which shows the data point you're trying to illustrate in the timeline. * Add a link to the incident call recording. 1. Perform an analysis of the incident. @@ -98,7 +98,7 @@ A general agenda for the meeting would be something like, 1. Recap the timeline, to make sure everyone agrees and is on the same page. 1. Recap important points, and any unusual items. 1. Discuss how the problem could've been caught. - * Did it show up in [canary](https://www.pagerduty.com/blog/continuous-build-break-fix-fast#canary-releases)? + * Did it show up in [canary](https://www.pagerduty.com/blog/continuous-build-break-fix-fast#canary-releases){:target="_blank" }? * Could it have been caught in tests, or loadtest environment? 1. Discuss customer impact. Any comments from customers, etc. 1. Review action items that have been created, discuss if appropriate, or if more are needed, etc. @@ -106,15 +106,15 @@ A general agenda for the meeting would be something like, ## Examples Here are some examples of postmortems from other companies as a reference, -* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/) -* [AWS](https://aws.amazon.com/message/5467D2/) -* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html) -* [Heroku](https://status.heroku.com/incidents/151) -* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5) -* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016) -* [A List of Postmortems!](https://github.com/danluu/post-mortems) +* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice/){:target="_blank" } +* [AWS](https://aws.amazon.com/message/5467D2/){:target="_blank" } +* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html){:target="_blank" } +* [Heroku](https://status.heroku.com/incidents/151){:target="_blank" } +* [Netflix](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5){:target="_blank" } +* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016){:target="_blank" } +* [A List of Postmortems!](https://github.com/danluu/post-mortems){:target="_blank" } ## Useful Resources -* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011) -* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/) +* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](https://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011){:target="_blank" } +* [Blame. Language. Sharing.](https://fractio.nl/2015/10/30/blame-language-sharing/){:target="_blank" } diff --git a/docs/before/call_etiquette.md b/docs/before/call_etiquette.md index c34a28a..dc49822 100644 --- a/docs/before/call_etiquette.md +++ b/docs/before/call_etiquette.md @@ -21,7 +21,7 @@ You've just joined an incident call and you've never been on one before. You hav ![Communication](../assets/img/misc/communicate.png) -Standard radio [voice procedure](https://en.wikipedia.org/wiki/Radiotelephony_procedure#Procedure_words) does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are, +Standard radio [voice procedure](https://en.wikipedia.org/wiki/Radiotelephony_procedure#Procedure_words){:target="_blank" } does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are, * **Ack/Rog** - "I have received and understood" * **Say Again** - "Repeat your last message" diff --git a/docs/before/different_roles.md b/docs/before/different_roles.md index 041dbfb..10f0049 100644 --- a/docs/before/different_roles.md +++ b/docs/before/different_roles.md @@ -138,7 +138,7 @@ All of the other roles will be actively working on identifying the cause and res 1. Drafting [external communication][ecg] messages when needed, picking the appropriate template, either when asked by the IC or at own initiative 1. Asking for more information / clarification if needed for clear communication 1. Regularly notify the IC of the number of customers reporting that they are affected by the incident. This can include providing specific customer references or examples for investigation purposes. -1. Post any publicly facing messages regarding the incident (Twitter, StatusPage, etc) once approved by the IC +1. Post any publicly facing messages regarding the incident (X, StatusPage, etc) once approved by the IC 1. Removing an ephemeral investigation message once approved by the IC 1. Provide customers with the external message from the postmortem once it is completed. diff --git a/docs/crisis/leadership.md b/docs/crisis/leadership.md index 1f6fa14..c236d33 100644 --- a/docs/crisis/leadership.md +++ b/docs/crisis/leadership.md @@ -15,7 +15,7 @@ When your company’s values are at the forefront, your stakeholder communicatio The unpredictable and fluid nature of a crisis requires situational awareness. Being aware of what you know and don't know is crucial. Continually monitoring the situation, predicting statuses and being prepared to roll with the changing environment makes your company adept at crisis response and provides your team with purpose, i.e., everyone is in sync and working towards the same goal. -An increasingly important aspect of Crisis Leadership is taking care of yourself and your team. Members of your crisis response team may have been impacted by the events but are still working to resolve it. Some of your team may have been awake for 24 hours needing someone to give them permission to step away. Fatigue may be setting in and so forth. Leveraging the functionality built into the PagerDuty platform to establish on-call rotations, hand-offs and integrate video conferencing technology like Zoom or Teams can help create a safe and healthy [on call culture](https://goingoncall.pagerduty.com/culture/) for your teams while responding to what could be a protracted situation. +An increasingly important aspect of Crisis Leadership is taking care of yourself and your team. Members of your crisis response team may have been impacted by the events but are still working to resolve it. Some of your team may have been awake for 24 hours needing someone to give them permission to step away. Fatigue may be setting in and so forth. Leveraging the functionality built into the PagerDuty platform to establish on-call rotations, hand-offs and integrate video conferencing technology like Zoom or Teams can help create a safe and healthy [on call culture](https://goingoncall.pagerduty.com/culture/){:target="_blank" } for your teams while responding to what could be a protracted situation. ![On-call Restrictions by day and hour](../assets/img/crisis/01_oncallrestrictions.png) @@ -85,12 +85,12 @@ Once you’ve built your handful of scenarios, assigning members of your organiz | Marketing campaign failure | Typo, untrue product claim, wrong tone | Chief Digital Officer | Communications Chief | -Using PagerDuty, you can build your [on-call schedule](https://support.pagerduty.com/docs/schedule-basics) right inside the platform providing visibility and accountability about who’s on call for what area of the business if a crisis situation takes place. You can also add backups using an escalation policy that alerts the next person up after a custom time delay. +Using PagerDuty, you can build your [on-call schedule](https://support.pagerduty.com/docs/schedule-basics){:target="_blank" } right inside the platform providing visibility and accountability about who’s on call for what area of the business if a crisis situation takes place. You can also add backups using an escalation policy that alerts the next person up after a custom time delay. ![Set escalation timeouts](../assets/img/crisis/02_escalationtimeout.png) -If you want to balance the load for your on-call team, the [round robin scheduling](https://support.pagerduty.com/docs/round-robin-scheduling) can help by alternating who’s the primary team member that’s notified for each crisis notification. +If you want to balance the load for your on-call team, the [round robin scheduling](https://support.pagerduty.com/docs/round-robin-scheduling){:target="_blank" } can help by alternating who’s the primary team member that’s notified for each crisis notification. ![Use round-robin scheduling](../assets/img/crisis/03_roundrobin.png) diff --git a/docs/crisis/operations.md b/docs/crisis/operations.md index 76fe37f..10ee2b4 100644 --- a/docs/crisis/operations.md +++ b/docs/crisis/operations.md @@ -7,19 +7,19 @@ description: Operationalizing your crisis plan begins by making practical change Operationalizing your crisis plan begins by making practical changes to ensure you have what you need, in the way you need it, and at the time you need it. For example, your broader crisis management plan will be too cumbersome for your team to scan through for answers during a crisis situation. On the other hand, playbooks are more focused versions of your larger plan which make them easier to action, test and maintain. They’re also scenario-driven and provide you with specific parameters, considerations and tasks. -Once you have these critical resources created, it can be difficult to centralize them and keep track of the most current version. PagerDuty makes this easy with the ability to add your runbooks, playbooks, policies and any other crisis response [documentation links](https://support.pagerduty.com/docs/service-profile#remediate) into your PagerDuty defined service(s). +Once you have these critical resources created, it can be difficult to centralize them and keep track of the most current version. PagerDuty makes this easy with the ability to add your runbooks, playbooks, policies and any other crisis response [documentation links](https://support.pagerduty.com/docs/service-profile#remediate){:target="_blank" } into your PagerDuty defined service(s). ![Ensure that your PagerDuty services have links to their runbooks and documentation](../assets/img/crisis/04_remediationdocs.png) ## Crisis Classification Scheme -Waking up your Executive Crisis Leadership Team in the middle of the night with a PagerDuty alert should be a very rare occurrence. Having a [classification scheme](https://support.pagerduty.com/docs/incident-priority#establish-an-incident-classification-scheme) in place to rank the actual or anticipated materiality of an event will help you avoid a cry wolf scenario. A simple scale such as Low, Medium, High or Level 1, 2, 3 can be effective. +Waking up your Executive Crisis Leadership Team in the middle of the night with a PagerDuty alert should be a very rare occurrence. Having a [classification scheme](https://support.pagerduty.com/docs/incident-priority#establish-an-incident-classification-scheme){:target="_blank" } in place to rank the actual or anticipated materiality of an event will help you avoid a cry wolf scenario. A simple scale such as Low, Medium, High or Level 1, 2, 3 can be effective. -Within PagerDuty, you can add your crisis “material impact levels” using the [incident priority](https://support.pagerduty.com/docs/incident-priority) feature. Remember that not all crises begin as a crisis. It may develop out of an ongoing incident so determining your thresholds for escalation ahead of time (e.g., 90 minutes without HVAC, 24 hours without direct contact, greater than $100k revenue at risk, etc.) is equally as important as the rankings. +Within PagerDuty, you can add your crisis “material impact levels” using the [incident priority](https://support.pagerduty.com/docs/incident-priority){:target="_blank" } feature. Remember that not all crises begin as a crisis. It may develop out of an ongoing incident so determining your thresholds for escalation ahead of time (e.g., 90 minutes without HVAC, 24 hours without direct contact, greater than $100k revenue at risk, etc.) is equally as important as the rankings. ![Set and define priorities that make sense for your organization](../assets/img/crisis/05_priorities.png) -Once you’ve defined your priorities, you can begin to leverage PagerDuty to automate parts of your crisis response through integrations and [incident workflows](https://support.pagerduty.com/docs/incident-workflows). You can integrate with Slack, Teams or Zoom for creating communications channels. You can auto-publish from templates to post on internal status pages. You can auto-initiate stakeholder alerts or [subscriptions](https://support.pagerduty.com/docs/communicate-with-stakeholders#add-subscribers-at-incident-creation), etc. +Once you’ve defined your priorities, you can begin to leverage PagerDuty to automate parts of your crisis response through integrations and [incident workflows](https://support.pagerduty.com/docs/incident-workflows){:target="_blank" }. You can integrate with Slack, Teams or Zoom for creating communications channels. You can auto-publish from templates to post on internal status pages. You can auto-initiate stakeholder alerts or [subscriptions](https://support.pagerduty.com/docs/communicate-with-stakeholders#add-subscribers-at-incident-creation){:target="_blank" }, etc. ![Use incident workflows to streamline response.](../assets/img/crisis/06_incidentworkflows.png) @@ -29,15 +29,15 @@ In a crisis situation, time savings are everything. Decreasing the mean time to Does your crisis response team operate the same in a crisis as they do in normal business situations? Your answer should be no. Operating in a “crisis mode” should be distinctive because all actions and decisions are amplified, the tempo is quicker, the need for timely decisions is critical, the complexity of the problems are greater, the risks are higher, etc. -The Crisis Team Leader needs to clearly and definitively signal that the modes of thinking and processing have shifted. What better way to signal that shift than through a PagerDuty alert? The incident priority feature is an easy way to make that declaration to the necessary stakeholders in a not so public way. Declaring the response as over is also important in transitioning to normal or new ways of doing things, which can be completed by [resolving the alert](https://support.pagerduty.com/docs/alerts#resolve-alerts) created on your crisis service(s) or posting to an internal status page. +The Crisis Team Leader needs to clearly and definitively signal that the modes of thinking and processing have shifted. What better way to signal that shift than through a PagerDuty alert? The incident priority feature is an easy way to make that declaration to the necessary stakeholders in a not so public way. Declaring the response as over is also important in transitioning to normal or new ways of doing things, which can be completed by [resolving the alert](https://support.pagerduty.com/docs/alerts#resolve-alerts){:target="_blank" } created on your crisis service(s) or posting to an internal status page. ## Crisis Response Management Operations -If you’ve followed along so far, you’ve essentially learned the ins and outs of a PagerDuty instance for crisis response. During your response, you don’t want to worry about how to contact the Crisis Team Leaders or which conference bridge you should be using or where your most up to date playbook is located. The operations side of things should just work. Aside from PagerDuty’s built-in alerting capabilities, the platform has 700+ [integrations](https://www.pagerduty.com/integrations/#Integrations-library) and more are possible through the API so you can bring your existing technology stack. +If you’ve followed along so far, you’ve essentially learned the ins and outs of a PagerDuty instance for crisis response. During your response, you don’t want to worry about how to contact the Crisis Team Leaders or which conference bridge you should be using or where your most up to date playbook is located. The operations side of things should just work. Aside from PagerDuty’s built-in alerting capabilities, the platform has 700+ [integrations](https://www.pagerduty.com/integrations/#Integrations-library){:target="_blank" } and more are possible through the API so you can bring your existing technology stack. -[Adding integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service) to your service(s) for crisis response at the minimum should include an email integration, an instant messaging integration with Slack, Google Chat, etc. and a video conferencing tool such as Zoom, Microsoft Teams, etc. This standard grouping enables you to trigger alerts multiple ways (e.g., web, mobile, email, API and instant messaging) and alert or advise your Executive Crisis Leadership Team that something is up (e.g., PagerDuty alert via email, SMS, push or voice, automated group channel message and [subscribers](https://support.pagerduty.com/docs/communicate-with-stakeholders#subscribe-to-a-business-service) to a service). +[Adding integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service){:target="_blank" } to your service(s) for crisis response at the minimum should include an email integration, an instant messaging integration with Slack, Google Chat, etc. and a video conferencing tool such as Zoom, Microsoft Teams, etc. This standard grouping enables you to trigger alerts multiple ways (e.g., web, mobile, email, API and instant messaging) and alert or advise your Executive Crisis Leadership Team that something is up (e.g., PagerDuty alert via email, SMS, push or voice, automated group channel message and [subscribers](https://support.pagerduty.com/docs/communicate-with-stakeholders#subscribe-to-a-business-service){:target="_blank" } to a service). -Given the scope of the [PagerDuty Operations Cloud](https://www.pagerduty.com/operations-cloud/), you’re likely not the only group within your organization running their operations through the platform. Your Customer Service organization may be using the platform alongside your Technical Operations organization. As a result, you’ll want to deploy some tradecraft as you trigger alerts, add notes and publish status pages to maintain the right level of privacy and compliance. +Given the scope of the [PagerDuty Operations Cloud](https://www.pagerduty.com/operations-cloud/){:target="_blank" }, you’re likely not the only group within your organization running their operations through the platform. Your Customer Service organization may be using the platform alongside your Technical Operations organization. As a result, you’ll want to deploy some tradecraft as you trigger alerts, add notes and publish status pages to maintain the right level of privacy and compliance. ![The PagerDuty Operations Cloud](../assets/img/crisis/07_operationscloud.png) diff --git a/docs/crisis/pagerduty.md b/docs/crisis/pagerduty.md index f71d8b2..83dfe60 100644 --- a/docs/crisis/pagerduty.md +++ b/docs/crisis/pagerduty.md @@ -6,45 +6,45 @@ description: PagerDuty's Operations Cloud provides various tools and features th ## PagerDuty Configuration How to set up your Crisis Response Management instance in PagerDuty: -[PagerDuty Mobile app](https://support.pagerduty.com/docs/mobile-app) - Ask each member to install and configure the mobile app for maximum reachability. +[PagerDuty Mobile app](https://support.pagerduty.com/docs/mobile-app){:target="_blank" } - Ask each member to install and configure the mobile app for maximum reachability. -[User Management](https://support.pagerduty.com/docs/users#add-users) - Make sure you’ve added your Executive Crisis Leadership and Crisis Response Team members to the system. +[User Management](https://support.pagerduty.com/docs/users#add-users){:target="_blank" } - Make sure you’ve added your Executive Crisis Leadership and Crisis Response Team members to the system. -[Contact information](https://support.pagerduty.com/docs/user-profile) - Ask each member to log into the web application and update their profile information including their phone, email and SMS contact information especially if they’ve changed devices. +[Contact information](https://support.pagerduty.com/docs/user-profile){:target="_blank" } - Ask each member to log into the web application and update their profile information including their phone, email and SMS contact information especially if they’ve changed devices. ![PagerDuty user contact information settings](../assets/img/crisis/09_usercontactinfo.png) -[Notification rules](https://support.pagerduty.com/docs/user-profile#notification-rules) - Ask each member to set their high urgency, low urgency, handoff and subscriber notification rules under their profile. +[Notification rules](https://support.pagerduty.com/docs/user-profile#notification-rules){:target="_blank" } - Ask each member to set their high urgency, low urgency, handoff and subscriber notification rules under their profile. ![Use multiple contact methods for high urgency incidents](../assets/img/crisis/10_highurgencynotifications.png) -[Teams](https://support.pagerduty.com/docs/teams) - Create teams for your Executive Crisis Leadership Team, each of your Crisis Team Leaders, and essential support functions like Crisis Communications, IT or Legal +[Teams](https://support.pagerduty.com/docs/teams){:target="_blank" } - Create teams for your Executive Crisis Leadership Team, each of your Crisis Team Leaders, and essential support functions like Crisis Communications, IT or Legal -[Services](https://support.pagerduty.com/docs/services-and-integrations#create-a-service) - Create and configure a service for each of your crisis categories led by your Crisis Team Leaders, e.g., supply chain, human resources, critical infrastructure, geopolitics, physical security, etc. +[Services](https://support.pagerduty.com/docs/services-and-integrations#create-a-service){:target="_blank" } - Create and configure a service for each of your crisis categories led by your Crisis Team Leaders, e.g., supply chain, human resources, critical infrastructure, geopolitics, physical security, etc. -[Urgency](https://support.pagerduty.com/docs/service-settings#notification-urgency) - Set your notification urgency for each service whether high, low, dynamic or based on operating hours +[Urgency](https://support.pagerduty.com/docs/service-settings#notification-urgency){:target="_blank" } - Set your notification urgency for each service whether high, low, dynamic or based on operating hours -[Escalation policies](https://support.pagerduty.com/docs/escalation-policies#create-an-escalation-policy) - Decide who gets notified first and how long before the notification escalates to the next team member and configure round robin scheduling if you wish to alternate per crisis +[Escalation policies](https://support.pagerduty.com/docs/escalation-policies#create-an-escalation-policy){:target="_blank" } - Decide who gets notified first and how long before the notification escalates to the next team member and configure round robin scheduling if you wish to alternate per crisis ![Escalation policies determine which responders are contacted](../assets/img/crisis/11_escalationpolicy.png) -[Integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service) - Add your instant messaging, video conferencing tool or create a custom email integration or connections to other systems for triggering alerts +[Integrations](https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service){:target="_blank" } - Add your instant messaging, video conferencing tool or create a custom email integration or connections to other systems for triggering alerts -[Schedules](https://support.pagerduty.com/docs/schedule-basics#create-a-schedule) - Create your on-call rotations for the teams associated with each crisis service +[Schedules](https://support.pagerduty.com/docs/schedule-basics#create-a-schedule){:target="_blank" } - Create your on-call rotations for the teams associated with each crisis service ![Using multiple layers in schedules helps teams create full coverage](../assets/img/crisis/12_schedulelayers.png) -[Incident Priority](https://support.pagerduty.com/docs/incident-priority) - Add your custom classification scheme for your crisis response escalation levels +[Incident Priority](https://support.pagerduty.com/docs/incident-priority){:target="_blank" } - Add your custom classification scheme for your crisis response escalation levels -[Incident workflows](https://support.pagerduty.com/docs/incident-workflows) - Create your workflows for each crisis based on conditions such as priority, status and urgency using system templates or from scratch +[Incident workflows](https://support.pagerduty.com/docs/incident-workflows){:target="_blank" } - Create your workflows for each crisis based on conditions such as priority, status and urgency using system templates or from scratch ![Incident workflows can help with communication and coordination](../assets/img/crisis/13_incidentworkflows.png) -[On-call readiness report](https://support.pagerduty.com/docs/on-call-readiness-reports) - Confirm that your teams are on-call ready and properly configured +[On-call readiness report](https://support.pagerduty.com/docs/on-call-readiness-reports){:target="_blank" } - Confirm that your teams are on-call ready and properly configured -[Postmortem template](https://support.pagerduty.com/docs/postmortems#customize-the-postmortem-template) - Configure your postmortem template to fit your needs post-crisis +[Postmortem template](https://support.pagerduty.com/docs/postmortems#customize-the-postmortem-template){:target="_blank" } - Configure your postmortem template to fit your needs post-crisis -[Status pages](https://support.pagerduty.com/docs/status-pages) - Configure your status page templates for internal stakeholders +[Status pages](https://support.pagerduty.com/docs/status-pages){:target="_blank" } - Configure your status page templates for internal stakeholders ![Use status updates to communicate with stakeholders](../assets/img/crisis/14_incidentstatusupdates.png) diff --git a/docs/crisis/prework.md b/docs/crisis/prework.md index e9fa8c7..1092c2d 100644 --- a/docs/crisis/prework.md +++ b/docs/crisis/prework.md @@ -9,6 +9,6 @@ You now have your Executive Crisis Leadership team, your crisis response managem ## Crisis Simulations Conducting discussion-based tabletop exercises with your team is an ideal starting point. However, leveraging functional exercises to simulate your level of maturity with crisis coordination, and command and control is also important. Running a crisis simulation using PagerDuty is as simple as triggering an alert on your crisis service—randomly if you really want to simulate real life. You would then follow your typical process of getting the right people on a conference call or instant messaging channel through an integration and running through a scenario with your corresponding playbook. -The PagerDuty platform will automatically track the length of the exercise and record any notes or status changes in the timeline which you can then use in your [postmortem](https://postmortems.pagerduty.com/what_is/) (i.e, after action report or hotwash) and in developing further tabletops or simulations. +The PagerDuty platform will automatically track the length of the exercise and record any notes or status changes in the timeline which you can then use in your [postmortem](https://postmortems.pagerduty.com/what_is/){:target="_blank" } (i.e, after action report or hotwash) and in developing further tabletops or simulations. A biannual cadence for crisis simulations provides sufficient time for preparation and to review the findings in the postmortem. diff --git a/docs/during/during_an_incident.md b/docs/during/during_an_incident.md index c6c25a8..88cd526 100644 --- a/docs/during/during_an_incident.md +++ b/docs/during/during_an_incident.md @@ -58,7 +58,7 @@ Resolve the incident as quickly and as safely as possible, use the Deputy to ass * **Degraded Service Behavior without load:** Gather forensic data (heap dumps, etc), and consider doing a rolling restart. 1. Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct Customer Liaison accordingly. - * Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, tweet it out"). + * Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, post it out"). 1. Keep track of your [span of control](../training/glossary.md#span-of-control). If the response starts to become larger, or the incident increases in complexity, consider [splitting off sub-teams](../before/complex_incidents.md#spinning-off-sub-teams) in order to get a more effective response. @@ -112,7 +112,7 @@ You are there to support the Incident Commander in identifying the cause of the ## Steps for Customer Liaison Be on stand-by to post public-facing messages regarding the incident. -1. You will typically be required to update the status page and to send tweets from our various accounts at certain times during the call. +1. You will typically be required to update the status page and to send posts from our various accounts at certain times during the call. 1. Follow instructions from the Incident Commander. diff --git a/docs/during/external_communication_guidelines.md b/docs/during/external_communication_guidelines.md index 23df9de..6fb7def 100644 --- a/docs/during/external_communication_guidelines.md +++ b/docs/during/external_communication_guidelines.md @@ -3,11 +3,11 @@ cover: assets/img/covers/whos_on-call.png description: Information on how to manage external communications --- -Information on how to manage external communications during an incident. See our [role descriptions](../before/different_roles/) for information about who is responsible for external communications. +Information on how to manage external communications during an incident. See our [role descriptions](../before/different_roles.md) for information about who is responsible for external communications. ## When to communicate publicly -Before you decide to communicate an incident, it’s important to have an agreed-upon set of criteria for when a major incident is communicated. False alarms and short-lived issues can sometimes kick off incident calls, so knowing when communication is appropriate will help your customers avoid widespread panic. This can be tied to your organization’s definition of [what an incident is](https://response.pagerduty.com/before/what_is_an_incident/), and/or your [severity levels](https://response.pagerduty.com/before/severity_levels/). +Before you decide to communicate an incident, it’s important to have an agreed-upon set of criteria for when a major incident is communicated. False alarms and short-lived issues can sometimes kick off incident calls, so knowing when communication is appropriate will help your customers avoid widespread panic. This can be tied to your organization’s definition of [what an incident is](../before/what_is_an_incident.md), and/or your [severity levels](../before/severity_levels.md). You might consider the following criteria as well: @@ -23,7 +23,7 @@ We also recommend coming up with a set of templates for different stages of an i ### Initial communication: -The first communication should indicate that an incident is under investigation. The goal here is to avoid a customer experiencing symptoms of the incident, checking status pages or Twitter accounts, and not seeing awareness of the issue from the business. +The first communication should indicate that an incident is under investigation. The goal here is to avoid a customer experiencing symptoms of the incident, checking status pages or social media accounts, and not seeing awareness of the issue from the business. - Decision and posting of initial communication happens within 5 minutes of kicking off the incident call. - These messages should be entirely templated for ease of action. @@ -63,7 +63,7 @@ Your final communication should be posted when full recovery of the incident has - Clear indication of any data loss or lingering corruption. - If there are no lingering impacts, clearly note this in the update. -Once this is posted, continue to follow the steps for [After an Incident](https://response.pagerduty.com/after/after_an_incident/) and the [Postmortem Process](https://response.pagerduty.com/after/post_mortem_process/). +Once this is posted, continue to follow the steps for [After an Incident](../after/after_an_incident.md) and the [Postmortem Process](../after/post_mortem_process.md). ## Quick Reference diff --git a/docs/during/security_incident_response.md b/docs/during/security_incident_response.md index 19eef2e..6ecc5bd 100644 --- a/docs/during/security_incident_response.md +++ b/docs/during/security_incident_response.md @@ -163,8 +163,8 @@ Once you have validated all of the information you have is accurate, have a time ## Additional Reading -* [Computer Security Incident Handling Guide](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf) (NIST) -* [Incident Handler's Handbook](https://www.sans.org/reading-room/whitepapers/incident/incident-handlers-handbook-33901) (SANS) -* [Responding to IT Security Incidents](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc700825(v=technet.10)) (Microsoft) -* [Defining Incident Management Processes for CSIRTs: A Work in Progress](https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153) (CMU) -* [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf) (CERT) +* [Computer Security Incident Handling Guide](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf){:target="_blank" } (NIST) +* [Incident Handler's Handbook](https://www.sans.org/reading-room/whitepapers/incident/incident-handlers-handbook-33901){:target="_blank" } (SANS) +* [Responding to IT Security Incidents](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc700825(v=technet.10)){:target="_blank" } (Microsoft) +* [Defining Incident Management Processes for CSIRTs: A Work in Progress](https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153){:target="_blank" } (CMU) +* [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf){:target="_blank" } (CERT) diff --git a/docs/getting_started.md b/docs/getting_started.md index 13d555b..cbaf76f 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -50,7 +50,7 @@ Adding more detailed fields and information can come later. Run a fake incident, mobilize your responders, and have someone act as the Incident Commander. Get used to the switch from normal day-to-day operations and the emergency operations of an incident. Switching to having an Incident Commander running the show can be jarring at first, so it helps to practice it in a low-risk situation to begin with. -Playing a game of "[Keep Talking and Nobody Explodes](https://www.keeptalkinggame.com/)" is a light-hearted way of practicing the skills required for incident response. You can also run your own version of [Failure Friday](https://www.pagerduty.com/blog/failure-fridays-four-years/), where you manually inject some failure into your system and treat it as a major incident. +Playing a game of "[Keep Talking and Nobody Explodes](https://www.keeptalkinggame.com/){:target="_blank" }" is a light-hearted way of practicing the skills required for incident response. You can also run your own version of [Failure Friday](https://www.pagerduty.com/blog/failure-fridays-four-years/){:target="_blank" }, where you manually inject some failure into your system and treat it as a major incident. ## Use it for a real incident. diff --git a/docs/oncall/alerting_principles.md b/docs/oncall/alerting_principles.md index 6804ebd..2bf7aba 100644 --- a/docs/oncall/alerting_principles.md +++ b/docs/oncall/alerting_principles.md @@ -67,7 +67,7 @@ We should ensure that alerts contain enough useful context to quickly identify t !!! info "Testing is Critical" An untested alert is equivalent to not having an alert at all. You cannot be sure it will alert you when the time comes. Testing that your alerting actually works is critical to proper service health and should be included in any release planning / deployment efforts. -Make sure to test all new and modified alerts. This is usually covered as part of [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) for any new service; however, you should manually test them if you need it more quickly. Some things to test: +Make sure to test all new and modified alerts. This is usually covered as part of [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/){:target="_blank" } for any new service; however, you should manually test them if you need it more quickly. Some things to test: * Test that the threshold is set appropriately. We don't want noisy alerts. * Test that you get alerted for the "No Data" condition if applicable. Generally, receiving no data is the same as breaking your threshold. diff --git a/docs/oncall/being_oncall.md b/docs/oncall/being_oncall.md index 3c0146c..49cb631 100644 --- a/docs/oncall/being_oncall.md +++ b/docs/oncall/being_oncall.md @@ -15,7 +15,7 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca * Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc). * Have a way to charge your MiFi. * Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone, etc.) accordingly. - * Make sure PagerDuty texts and calls can [bypass your "Do Not Disturb" settings](https://support.pagerduty.com/docs/notification-phone-numbers). + * Make sure PagerDuty texts and calls can [bypass your "Do Not Disturb" settings](https://support.pagerduty.com/docs/notification-phone-numbers){:target="_blank" }. * Be prepared (environment is set up, a current working copy of the necessary repositories is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, etc.) * Read our incident response documentation (that's this!) to understand how we handle serious incidents, what the different roles and methods of communication are, etc. * Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments, etc. diff --git a/docs/oncall/whos_oncall.md b/docs/oncall/whos_oncall.md index 51bbf10..edf3fb1 100644 --- a/docs/oncall/whos_oncall.md +++ b/docs/oncall/whos_oncall.md @@ -14,7 +14,7 @@ Which engineering teams are involved in which responses varies with a company’ ## Customer Support / Customer Success -Support is the voice of the customer during incident response. A member of the Customer Support team is the default [Customer Liaison](../training/customer_liaison.md) within the response team, updating customers and stakeholders about incident status through Twitter, an internal Slack channel, and other channels as needed. They may also serve as an internal liaison to keep stakeholders within the company up to date. +Support is the voice of the customer during incident response. A member of the Customer Support team is the default [Customer Liaison](../training/customer_liaison.md) within the response team, updating customers and stakeholders about incident status through social media, an internal Slack channel, and other channels as needed. They may also serve as an internal liaison to keep stakeholders within the company up to date. ## Marketing diff --git a/docs/resources/anti_patterns.md b/docs/resources/anti_patterns.md index 9de7ebd..8943a5e 100644 --- a/docs/resources/anti_patterns.md +++ b/docs/resources/anti_patterns.md @@ -18,7 +18,7 @@ As we grew our engineer department, this did not scale well at all, and problems 1. Paging people has a cost impact. Both in employee health, and in finance. Waking up your entire engineering department at 3am means nothing productive is going to be done the next day, across the entire department. 1. People who weren't on-call would still get paged. -It's important to maintain an **effective span of control** on any incident response. If you have more than 7 or 8 people directly reporting to the [Incident Commander]() things can quickly get overwhelming. We now will only page the engineers who are on-call for a specific service, rather than the entire team. If more responders are required, then they will be mobilized by the [Internal Liaison]() to join the response. 9 times out of 10 we don't need additional responders, so the rest of the engineering department can get some rest without interference. This results in a happier engineering department and a more streamlined response process. +It's important to maintain an **effective span of control** on any incident response. If you have more than 7 or 8 people directly reporting to the [Incident Commander](/before/different_roles/#incident-commander-ic) things can quickly get overwhelming. We now will only page the engineers who are on-call for a specific service, rather than the entire team. If more responders are required, then they will be mobilized by the [Internal Liaison]() to join the response. 9 times out of 10 we don't need additional responders, so the rest of the engineering department can get some rest without interference. This results in a happier engineering department and a more streamlined response process. ## Forcing everyone to stay on the call. diff --git a/docs/resources/reading.md b/docs/resources/reading.md index 0c94a2f..5e545f3 100644 --- a/docs/resources/reading.md +++ b/docs/resources/reading.md @@ -9,33 +9,31 @@ This is a collection of additional reading on the topic of incident response tha ## Books -* [Incident Management for Operations](https://learning.oreilly.com/library/view/~/9781491917619/) (Rob Schnepp, Ron Vidal, Chris Hawley) -* [Incident Response](https://learning.oreilly.com/library/view/~/0596001304/) (Kenneth R. van Wyk, Richard Forno) -* [The Checklist Manifesto](http://atulgawande.com/book/the-checklist-manifesto/) (Atul Gawande) -* [The Field Guide to Understanding Human Error](https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265) (Sidney Dekker) -* [Normal Accidents: Living with High-Risk Technologies](https://www.amazon.com/Normal-Accidents-Living-High-Risk-Technologies/dp/0691004129) (Charles Perrow) -* [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) (Google) -* [IT Disaster Response: Lessons Learned in the Field](https://www.amazon.com/Disaster-Response-Lessons-Learned-Field/dp/1484221834) (Greg D. Moore) -* [Comparative Emergency Management](https://training.fema.gov/hiedu/aemrc/booksdownload/compemmgmtbookproject/) (David A. McEntire, Ph.D.) +* [Incident Management for Operations](https://learning.oreilly.com/library/view/~/9781491917619/){:target="_blank"} (Rob Schnepp, Ron Vidal, Chris Hawley) +* [Incident Response](https://learning.oreilly.com/library/view/~/0596001304/){:target="_blank"} (Kenneth R. van Wyk, Richard Forno) +* [The Checklist Manifesto](http://atulgawande.com/book/the-checklist-manifesto/){:target="_blank"} (Atul Gawande) +* [The Field Guide to Understanding Human Error](https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265){:target="_blank"} (Sidney Dekker) +* [Normal Accidents: Living with High-Risk Technologies](https://www.amazon.com/Normal-Accidents-Living-High-Risk-Technologies/dp/0691004129){:target="_blank"} (Charles Perrow) +* [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/){:target="_blank"} (Google) +* [IT Disaster Response: Lessons Learned in the Field](https://www.amazon.com/Disaster-Response-Lessons-Learned-Field/dp/1484221834){:target="_blank"} (Greg D. Moore) ## Documents -* [Debriefing Facilitation Guide](https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf) (Etsy) +* [Debriefing Facilitation Guide](https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf){:target="_blank"} (Etsy) ## Talks -* [Every Minute Counts: Leading Heroku's Incident Response](https://www.heavybit.com/library/video/every-minute-counts-coordinating-herokus-incident-response/) (Blake Gentry) -* [Three Analytical Traps in Accident Investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U) (Dr. Johan Bergström) +* [Every Minute Counts: Leading Heroku's Incident Response](https://www.heavybit.com/library/video/every-minute-counts-coordinating-herokus-incident-response/){:target="_blank"} (Blake Gentry) +* [Three Analytical Traps in Accident Investigation](https://www.youtube.com/watch?v=TqaFT-0cY7U){:target="_blank"} (Dr. Johan Bergström) ## Official Resources -* [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system) (FEMA) -* [UK Government Fire and Rescue Manual - Incident Command](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/7643/incidentcommand.pdf) (UK.GOV) -* [New Zealand Coordinated Incident Management System (CIMS)](https://www.civildefence.govt.nz/resources/coordinated-incident-management-system-cims-third-edition/) (NZCDEM) -* [The Australasian Inter-Service Incident Management System (AIIMS)](https://training.fema.gov/hiedu/docs/cem/comparative%20em%20-%20session%2021%20-%20handout%2021-1%20aiims%20manual.pdf) (AFAC) -* [Academic Emergency Management and Related Courses](https://training.fema.gov/hiedu/aemrc/) (FEMA) +* [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system){:target="_blank"} (FEMA) +* [UK Government Fire and Rescue Manual - Incident Command](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/7643/incidentcommand.pdf){:target="_blank"} (UK.GOV) +* [New Zealand Coordinated Incident Management System (CIMS)](https://www.civildefence.govt.nz/resources/coordinated-incident-management-system-cims-third-edition/){:target="_blank"} (NZCDEM) + ## Other Useful Resources -* [Informed's NIMS Incident Command System Field Guide](https://www.amazon.com/gp/product/1284038408) (Michael J. Ward) -* [Linkedin's School Of SRE](https://linkedin.github.io/school-of-sre/) (LinkedIn) +* [Informed's NIMS Incident Command System Field Guide](https://www.amazon.com/gp/product/1284038408){:target="_blank"} (Michael J. Ward) +* [Linkedin's School Of SRE](https://linkedin.github.io/school-of-sre/){:target="_blank"} (LinkedIn) diff --git a/docs/training/courses/incident_response.md b/docs/training/courses/incident_response.md index b7f2a89..07b58d4 100644 --- a/docs/training/courses/incident_response.md +++ b/docs/training/courses/incident_response.md @@ -51,7 +51,7 @@ We want less of the latter, and more of the former. We want to replace chaos wit ### What is Incident Response? -_004. What is incident response? [Docs Reference](../../before/what_is_an_incident/#what-is-incident-response)_ +_004. What is incident response? [Docs Reference](../../before/what_is_an_incident.md/#what-is-incident-response)_ So when we talk about incident response, what we're really talking about is an organized approach to addressing and managing an incident. This is how we define incident response at PagerDuty. They key here is on the word _organized_. We don't want to be running around in a panic anytime an alert goes off. We want our response to be almost routine, and for everyone to work together like a well-oiled machine. @@ -216,7 +216,7 @@ When developing our process at PagerDuty, we looked at a few of the other system ???+ aside hide-arrow "Emergency Management Around the World" If you're interested in learning more about the systems in use by other countries, we have [links to some official resources](../../resources/reading.md#official-resources). - There's also a book available from the US FEMA website, called "[Comparative Emergency Management: Understanding Disaster Policies, Organizations, and Initiatives from Around the World](https://training.fema.gov/hiedu/aemrc/booksdownload/compemmgmtbookproject/)" where it compares the systems used by about 30 different countries. + There's also a book available from the US FEMA website, called "Comparative Emergency Management: Understanding Disaster Policies, Organizations, and Initiatives from Around the World" where it compares the systems used by about 30 different countries. --- @@ -237,7 +237,7 @@ While we don’t use exactly the same roles as ICS, we picked out the ones that Together, these roles are called the Command Staff. -* Next we have the **Customer Liaison**. This is a member of our support team, and their job is to handle the two-way interaction with our customers. So they'll update customers as to what is going on, whether that's via email, tweet, or updating our status page. But they'll also let us know what customers are saying too. If we're getting 100's of support requests, or no one has raised a ticket at all. Since this can be useful information in tracking down a cause, and determining the level of risk we can take during our recovery. +* Next we have the **Customer Liaison**. This is a member of our support team, and their job is to handle the two-way interaction with our customers. So they'll update customers as to what is going on, whether that's via email, post, or updating our status page. But they'll also let us know what customers are saying too. If we're getting 100's of support requests, or no one has raised a ticket at all. Since this can be useful information in tracking down a cause, and determining the level of risk we can take during our recovery. * The **Internal Liaison** is a relatively new role in our process. Their job is to handle all the interaction with internal teams, such as our executives, or our marketing teams, and so on. We have a separate Slack channel for incident updates, to which the Internal Liaison will post regular status updates, and answer any questions from the rest of the organization. This keeps those questions out of our main response, but allows people to still get answers. The internal liaison will also page other teams as necessary if they're needed on the response. Again, this isn't a role you'll need for most companies, for a while this was also handled by our Deputy/Scribe role. @@ -446,7 +446,7 @@ Once we have a collection of actions and their associated risks, it's time to ma There’s no golden rule here I can give you, it’ll be up to context and your company culture. But my advice if you can't decide between two options is to literally flip a coin. **Making the wrong decision is better than making no decision.** Making no decision doesn't help to make forward progress, you learn nothing new and the incident is still going on. Making a decision, even if it's the "wrong" one will give you more information. If it turns out to be wrong, you can then put all your resources into the other option. -A wrong decision gives you more useful information, making no decision gives you nothing. You want to avoid [decision paralysis](https://xkcd.com/1801/) at all costs, as it can prolong your incident further. +A wrong decision gives you more useful information, making no decision gives you nothing. You want to avoid [decision paralysis](https://xkcd.com/1801/){:target="_blank"} at all costs, as it can prolong your incident further. ???+ aside hide-arrow "'Do nothing' is an acceptable decision." I should note that the above advice is intended for the situation when you can't decide between two options. "Do nothing" is a perfectly acceptable decision if that's the course of action you want to take. It is sometimes appropriate to get more information by waiting and seeing what changes. @@ -458,7 +458,7 @@ A wrong decision gives you more useful information, making no decision gives you _034. Gain consensus._ -Once we've made a decision, we need to gain consensus for our plan. But wait, why? Didn't I say earlier that the IC is basically a dictator and everyone should follow their instructions? While technically true, we want to be sure we give a chance to listen to any potential problems our experts may have with the plan. We don't want people to come back later and say things like "I knew that wouldn't work". We want to make sure we stop the [hindsight 20/20 problem](https://en.wikipedia.org/wiki/Hindsight_bias). It demotivates responders, and wastes time. +Once we've made a decision, we need to gain consensus for our plan. But wait, why? Didn't I say earlier that the IC is basically a dictator and everyone should follow their instructions? While technically true, we want to be sure we give a chance to listen to any potential problems our experts may have with the plan. We don't want people to come back later and say things like "I knew that wouldn't work". We want to make sure we stop the [hindsight 20/20 problem](https://en.wikipedia.org/wiki/Hindsight_bias){:target="_blank"}. It demotivates responders, and wastes time. But gaining consensus amongst a large group of people can be a bit difficult. @@ -526,7 +526,7 @@ It's likely because of how I phrased the question. _039. Bystander effect._ -I said "Can someone...". This is called the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect). Everyone assumed someone else was doing it, so no one ended up doing it. If by some chance, someone actually did do it, you won't know who it is anyway, or if they've even started. +I said "Can someone...". This is called the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect){:target="_blank"}. Everyone assumed someone else was doing it, so no one ended up doing it. If by some chance, someone actually did do it, you won't know who it is anyway, or if they've even started. A good example of this is if there's a medical emergency, and you shout "Somebody call 911!", you'll find that no one does, because everyone assumes someone else is doing it. If you're ever in that situation, you want to point to someone and say "You, call 911". Then it'll get done. @@ -1034,9 +1034,9 @@ Finally, you want to practice your incident response process as much as you can. Start by running mock incidents. Then treat your smaller incidents as if they're larger ones. If you trigger incident response and find it's not a real incident, treat it like one anyway since it's free practice. -At PagerDuty, we run something called [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) where we purposefully inject failure into our systems to test their resilience. We treat this like a major incident, with an incident commander and everything. It allows us to practice while we're not under the stress of a normal incident. +At PagerDuty, we run something called [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/){:target="_blank"} where we purposefully inject failure into our systems to test their resilience. We treat this like a major incident, with an incident commander and everything. It allows us to practice while we're not under the stress of a normal incident. -We also play a game called [Keep Talking And Nobody Explodes](https://www.keeptalkinggame.com). Yes, that's right, we play video games at work. But we've found that this game really helps to simulate a lot of the things an incident commander has to deal with, and is a great way to get some stress free practice. +We also play a game called [Keep Talking And Nobody Explodes](https://www.keeptalkinggame.com){:target="_blank"}. Yes, that's right, we play video games at work. But we've found that this game really helps to simulate a lot of the things an incident commander has to deal with, and is a great way to get some stress free practice. The bottom line is to practice as much as you can, so that when you do have the inevitable incident, your response is just routine. @@ -1049,7 +1049,7 @@ _080. Our open-source incident response documentation._ This was just a brief taste of the training we run at PagerDuty for our own Incident Commanders. We had nowhere near enough time to cover everything. -Good news though! We have published our entire incident response process online. It is an exact copy of our internal documentation only with things like phone numbers removed. It's complete free to use, and is open-sourced under an Apache 2 license so you can use it in your own organizations. [It's on GitHub](https://github.com/PagerDuty/incident-response-docs) and we do accept pull requests if you spot any mistakes or have improvement suggestions. +Good news though! We have published our entire incident response process online. It is an exact copy of our internal documentation only with things like phone numbers removed. It's complete free to use, and is open-sourced under an Apache 2 license so you can use it in your own organizations. [It's on GitHub](https://github.com/PagerDuty/incident-response-docs){:target="_blank"} and we do accept pull requests if you spot any mistakes or have improvement suggestions. Everything I've talked about today can be found in the documentation, and there's lots of great [additional reading material](../../resources/reading.md) if you want to learn more. @@ -1074,8 +1074,7 @@ Incident command training is useful in so many situations outside of a server ex Anyway, with that, I'll leave you with a quick summary of the main things we discussed today. Thanks! ???+ aside hide-arrow "Questions?" - If you have questions about this training material, feel free to ask me on Twitter, I'm [@r_adams](https://twitter.com/r_adams). - + If you have questions about this training material, feel free to reach out to the [PagerDuty Advocates team](mailto:advocates@pagerduty.com) --- ### Image Credits diff --git a/docs/training/customer_liaison.md b/docs/training/customer_liaison.md index e34874e..afafc2f 100644 --- a/docs/training/customer_liaison.md +++ b/docs/training/customer_liaison.md @@ -24,7 +24,7 @@ Read up on our [Different Roles for Incidents](../before/different_roles.md) to There is no formal training process for this role, you should feel free to contact our Customer Support team to learn more. ## Customer Liaison -The objective of a Customer Liaison is to keep our customers informed during an incident as to what is happening, and to act as a voice for our customers to the Incident Commander. It is important for customers to have visibility into how they are impacted by an incident we are having, and to have insight into the fact that the problem is actively being worked on. Crafting a public message for customers is tricky, especially on platforms such as Twitter where the number of characters you can use are limited. But here are some general tips for crafting a public message, +The objective of a Customer Liaison is to keep our customers informed during an incident as to what is happening, and to act as a voice for our customers to the Incident Commander. It is important for customers to have visibility into how they are impacted by an incident we are having, and to have insight into the fact that the problem is actively being worked on. Crafting a public message for customers is tricky, especially on platforms where the number of characters you can use are limited. But here are some general tips for crafting a public message, * Prepare a default message in advance. * One that can be used for the initial update if the scope of the issue is unknown. @@ -45,7 +45,7 @@ The objective of a Customer Liaison is to keep our customers informed during an * Customers don't care if `application-server-123` is having issues, they care that they are not getting notifications. Make sure the information you provide is relevant and not just noise. ## Incident Call Procedures and Lingo -The [Steps for Customer Liaison](../during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. +The [Steps for Customer Liaison](../during/during_an_incident.md/#steps-for-customer-liaison) provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls. diff --git a/docs/training/deputy.md b/docs/training/deputy.md index 5f67d59..3783c08 100644 --- a/docs/training/deputy.md +++ b/docs/training/deputy.md @@ -28,12 +28,12 @@ The training process for a Deputy is quite simple. * Read this page. ## Incident Call Procedures and Lingo -The [Steps for Deputy](../during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. +The [Steps for Deputy](../during/during_an_incident.md/#steps-for-deputy) provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls. ### Alert IC to Timers -You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call, +You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to post it out. This is as simple as telling the IC on the call, > IC, be advised the incident is now at the 10 minute mark. diff --git a/docs/training/glossary.md b/docs/training/glossary.md index 695533a..dc71c0a 100644 --- a/docs/training/glossary.md +++ b/docs/training/glossary.md @@ -5,19 +5,19 @@ description: Ever wonder what all of those strange words you sometimes see in ou Ever wonder what all of those strange words you sometimes see in our documentation mean? This page is here to help. ### Incident Commander / IC -The Incident Commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as Incident Commander are final. [More info](../before/different_roles.md). +The Incident Commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as Incident Commander are final. [More info](../before/different_roles.md/#incident-commander-ic). ### Deputy -Typically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. [More info](../before/different_roles.md). +Typically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. [More info](../before/different_roles.md/#deputy). ### Scribe -The Scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. [More info](../before/different_roles.md). +The Scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. [More info](../before/different_roles.md/#scribe). ### Resolver -A person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). [More info](../before/different_roles.md). +A person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). [More info](../before/different_roles.md/#subject-matter-expert). ### SME -"Subject Matter Expert", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. [More info](../before/different_roles.md). +"Subject Matter Expert", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. [More info](../before/different_roles.md/#subject-matter-expert). ### Command Staff The Command Staff consists of the Incident Commander, Deputy, and Scribe. diff --git a/docs/training/incident_commander.md b/docs/training/incident_commander.md index 9dfffa4..2b217a1 100644 --- a/docs/training/incident_commander.md +++ b/docs/training/incident_commander.md @@ -37,12 +37,12 @@ The process is fairly loose for now. Here's a list of things you can do to train * Read the rest of this page, particularly the sections below. -* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/){:target="_blank"} (FF). * Shadow a FF to see how it's run. * Be the Scribe for multiple FF's. * Be the Incident Commander for multiple FF's. -* Play a game of "[Keep Talking and Nobody Explodes](https://www.keeptalkinggame.com/)" with other people in the office. +* Play a game of "[Keep Talking and Nobody Explodes](https://www.keeptalkinggame.com/){:target="_blank"}" with other people in the office. * For a more realistic experience, play it with someone in a different office over Hangouts. * Shadow a current Incident Commander for at least a full week shift. @@ -86,7 +86,7 @@ _Next step is to stabilize the incident. We need to determine what we can do to * Making the "wrong" decision is better than making no decision. If you have nothing but bad options, pick one and proceed. 1. **Gain consensus. _- Ask "Are there any strong objections?"_** - * Gather support for the plan (See "Polling During a Decision" below). + * Gather support for the plan (See "[Polling During a Decision](/training/incident_commander/#gaining-consensus-polling-during-a-decision)" below). * Listen for objections. * Be prepared to adjust your plan if new information is presented. @@ -174,7 +174,7 @@ When you need to give out an assignment or task, you should follow these three s 1. Confirm that the responder has acknowledged and understood the instructions. !!!warning "Can someone..." - Never say "Can someone..." as this leads to the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect). Tasks should always be assigned directly to an individual, and never just thrown out with the hope that someone will pick it up. + Never say "Can someone..." as this leads to the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect){:target="_blank"}. Tasks should always be assigned directly to an individual, and never just thrown out with the hope that someone will pick it up. > IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes. @@ -200,7 +200,7 @@ It's important to maintain a cadence during a major incident call. Whenever ther > While we wait for [X], here's an update of our current situation. -> We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time. +> We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have posted it out that we are experiencing issues. Our next post will be in 10 minutes if the incident is still ongoing at that time. > Are there any additional actions or proposals from anyone else at this time? diff --git a/docs/training/internal_liaison.md b/docs/training/internal_liaison.md index 4553ba3..77dc2a8 100644 --- a/docs/training/internal_liaison.md +++ b/docs/training/internal_liaison.md @@ -23,7 +23,7 @@ Read up on our [Different Roles for Incidents](../before/different_roles.md) to There is no formal training process for this role, reading this page should be sufficient for most tasks. ## Incident Call Procedures and Lingo -The [Steps for Internal Liaison](../during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. +The [Steps for Internal Liaison](../during/during_an_incident.md/#steps-for-internal-liaison) provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls. diff --git a/docs/training/overview.md b/docs/training/overview.md index d2af250..6260b0f 100644 --- a/docs/training/overview.md +++ b/docs/training/overview.md @@ -21,12 +21,12 @@ We've also published slides and videos of some of our training courses. Original * [Incident Response Training Course](../training/courses/incident_response.md) - Introductory course on incident response and the role of the Incident Commander. ## Example Incident -This recorded call is a reenactment of an actual major incident that occurred at PagerDuty in [January 2017](https://status.pagerduty.com/incidents/510k1bnvwv6g). Some details have been changed in the interest of brevity and privacy, but the incident remains otherwise largely intact. For more details about the recording, you can read the [PagerDuty blog post](https://www.pagerduty.com/blog/incident-response-reenactment/). +This recorded call is a reenactment of an actual major incident that occurred at PagerDuty in January 2017. Some details have been changed in the interest of brevity and privacy, but the incident remains otherwise largely intact. For more details about the recording, you can read the [PagerDuty blog post](https://www.pagerduty.com/blog/incident-response-reenactment/){:target="_blank" }. ## National Incident Management System (NIMS) -Our incident response process is loosely based on the [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system), which is described as, +Our incident response process is loosely based on the [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system){:target="_blank" }, which is described as, _A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property, and harm to the environment._ @@ -34,53 +34,53 @@ While it might not initially seem that this would be applicable to an IT operati [![NIMS](../assets/img/thumbnails/nims_core.png)](https://www.fema.gov/pdf/emergency/nims/NIMS_core.pdf) [![NIMS Training](../assets/img/thumbnails/nims_training.png)](https://www.fema.gov/pdf/emergency/nims/nims_training_program.pdf) -If you want to learn more about NIMS, we recommend the [ICS-100](https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b) and [ICS-700](https://training.fema.gov/is/courseoverview.aspx?code=IS-700.a) online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of [additional training material and courses from FEMA](https://training.fema.gov/nims/) on NIMS, which I would encourage you to look at. +If you want to learn more about NIMS, we recommend the [ICS-100](https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b){:target="_blank" } and [ICS-700](https://training.fema.gov/is/courseoverview.aspx?code=IS-700.a){:target="_blank" } online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of [additional training material and courses from FEMA](https://training.fema.gov/nims/){:target="_blank" } on NIMS, which I would encourage you to look at. -If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local [CERT programs](https://www.ready.gov/cert) (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too. +If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local [CERT programs](https://www.ready.gov/cert){:target="_blank" } (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too. Also take a look at the [Additional Reading](../resources/reading.md) page. ## Incident Response Around the World + While NIMS is the US incident response framework, many countries have their own similar frameworks. Some are based on the US system, but many were developed on their own. There's a wealth of additional information to be learned by investigating the methods and frameworks used in countries all over the world. -A book called "[Comparative Emergency Management: Understanding Disaster Policies, Organizations, and Initiatives from Around the World](https://training.fema.gov/hiedu/aemrc/booksdownload/compemmgmtbookproject/)" (available from the [FEMA website](https://training.fema.gov/hiedu/aemrc/)) compares the systems used by 30 or so different countries, and is an amazing collection of information on emergency management frameworks used around the world. +A book called "Comparative Emergency Management: Understanding Disaster Policies, Organizations, and Initiatives from Around the World" compares the systems used by 30 or so different countries, and is an amazing collection of information on emergency management frameworks used around the world. Here are a few of the systems we looked at in more detail in order to adapt and improve our own process at PagerDuty. ### United Kingdom -The United Kingdom emergency services use a command hierarchy called [**Gold-Silver-Bronze Command Structure**](https://en.wikipedia.org/wiki/Gold%E2%80%93silver%E2%80%93bronze_command_structure) for their major operations. The framework involves three levels responsible for strategic (gold), tactical (silver), and operational (bronze) command decisions. +The United Kingdom emergency services use a command hierarchy called [**Gold-Silver-Bronze Command Structure**](https://en.wikipedia.org/wiki/Gold%E2%80%93silver%E2%80%93bronze_command_structure){:target="_blank" } for their major operations. The framework involves three levels responsible for strategic (gold), tactical (silver), and operational (bronze) command decisions. Here are some useful reading materials if you're interested in learning more: -* [UK.GOV - Emergency Response and Recovery](https://www.gov.uk/guidance/emergency-response-and-recovery). -* [UK.GOV - Incident Command - 3rd Edition (2008)](https://www.gov.uk/government/publications/fire-and-rescue-manual-volume-1-incident-command). -* [UK Home Office - Critical Incident Management](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/735103/critical-incident-management-v12.0ext.pdf) (PDF). +* [UK.GOV - Emergency Response and Recovery](https://www.gov.uk/guidance/emergency-response-and-recovery){:target="_blank" }. +* [UK.GOV - Incident Command - 3rd Edition (2008)](https://www.gov.uk/government/publications/fire-and-rescue-manual-volume-1-incident-command){:target="_blank"}. +* [UK Home Office - Critical Incident Management](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/735103/critical-incident-management-v12.0ext.pdf){:target="_blank" } (PDF). ### New Zealand -New Zealand's system is called the [**Coordinated Incident Management System (CIMS)**](https://en.wikipedia.org/wiki/Coordinated_Incident_Management_System) and is based upon the Incident Command System used in the US. One area we particularly liked from CIMS is its focus on common terminology, which helps prevents confusion during an incident and allows for a faster and more effective response. Some terminology has been changed from ICS (e.g. "Control" instead of "Command" to describe the management functions), but should still be familiar. +New Zealand's system is called the [**Coordinated Incident Management System (CIMS)**](https://en.wikipedia.org/wiki/Coordinated_Incident_Management_System){:target="_blank" } and is based upon the Incident Command System used in the US. One area we particularly liked from CIMS is its focus on common terminology, which helps prevents confusion during an incident and allows for a faster and more effective response. Some terminology has been changed from ICS (e.g. "Control" instead of "Command" to describe the management functions), but should still be familiar. Here are some useful reading materials if you're interested in learning more: -* [Ministry of Civil Defence & Emergency Management - New Zealand Coordinated Incident Management System (CIMS)](https://www.civildefence.govt.nz/resources/coordinated-incident-management-system-cims-third-edition/) ([PDF](https://www.civildefence.govt.nz/assets/Uploads/CIMS-3rd-edition-FINAL-Aug-2019.pdf)). -* [Devereux-Blum Training & Development - Emergency Management Training](https://www.emergencymanagement.co.nz/) +* [Ministry of Civil Defence & Emergency Management - New Zealand Coordinated Incident Management System (CIMS)](https://www.civildefence.govt.nz/resources/coordinated-incident-management-system-cims-third-edition/){:target="_blank" } +* [Devereux-Blum Training & Development - Emergency Management Training](https://www.emergencymanagement.co.nz/){:target="_blank"} + ### Australia -Australia uses a system called the [**Australasian Inter-Service Incident Management System (AIIMS)**](https://en.wikipedia.org/wiki/Australasian_Inter-Service_Incident_Management_System) which is a derivative of the NIMS framework used in the US. While based on ICS, AIIMS puts a bigger focus on _span of control_ than other frameworks. As with New Zealand's system, there are some differences to the terminology being used (e.g. "Incident Controller" instead of "Incident Commander"), but it should still be familiar to those who know ICS. +Australia uses a system called the [**Australasian Inter-Service Incident Management System (AIIMS)**](https://en.wikipedia.org/wiki/Australasian_Inter-Service_Incident_Management_System){:target="_blank" } which is a derivative of the NIMS framework used in the US. While based on ICS, AIIMS puts a bigger focus on span of control than other frameworks. As with New Zealand's system, there are some differences to the terminology being used (e.g. "Incident Controller" instead of "Incident Commander"), but it should still be familiar to those who know ICS. Here are some useful reading materials if you're interested in learning more: -* [The Australasian Inter-Service Incident Management System, 3rd Edition](https://training.fema.gov/hiedu/docs/cem/comparative%20em%20-%20session%2021%20-%20handout%2021-1%20aiims%20manual.pdf) (PDF). -* [Incident Management in Australia Handbook](https://knowledge.aidr.org.au/resources/handbook-14-incident-management-in-australia/) +* [Incident Management in Australia Handbook](https://knowledge.aidr.org.au/resources/handbook-14-incident-management-in-australia/){:target="_blank"} ### Canada -Canada uses their own [**Incident Command System**](https://www.icscanada.ca/images/upload/ICS%20OPS%20Description2012.pdf) (PDF). The standard for which is maintained by a network of organizations called [ICS Canada](https://www.icscanada.ca/en/home.html). Their website has a collection of information on how you can find local training courses in Canada, depending on your Province. +Canada uses their own [**Incident Command System**](https://icscanada.ca/wp-content/uploads/2023/11/2021-Part-1-TOX-Tips-Introduction-to-ICS.pdf){:target="_blank"} (PDF). The standard for which is maintained by a network of organizations called [ICS Canada](https://www.icscanada.ca/en/home.html){:target="_blank"}. Their website has a collection of information on how you can find local training courses in Canada, depending on your Province. Here are some useful reading materials if you're interested in learning more: -* [ICSCanada - I-100 Introduction to Incident Command System](https://www.svffa.ca/s/ICS100-Self-Paced-Student-Workbook_2016.pdf) (PDF). -* [Canada ICS Forms](https://www.icscanada.ca/en/Forms.html) - _Standard ICS forms that you can download and use in your own incidents ([FEMA has the US equivalents](https://training.fema.gov/icsresource/icsforms.aspx))._ +* [Canada ICS Forms](https://icscanada.ca/resources/ics-forms/){:target="_blank"} - Standard ICS forms that you can download and use in your own incidents ([FEMA has the US equivalents](https://training.fema.gov/icsresource/icsforms.aspx){:target="_blank"} ). diff --git a/docs/training/scribe.md b/docs/training/scribe.md index f652423..ecacdd2 100644 --- a/docs/training/scribe.md +++ b/docs/training/scribe.md @@ -25,7 +25,7 @@ There is no formal training process for this role, reading this page should be s * Read the rest of this page, particularly the sections below. -* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/){:target="_blank"} (FF). * Shadow a FF to see how it's run. * Be the Scribe for multiple FF's. @@ -40,7 +40,7 @@ Scribing is more art than science. The objective is to keep an accurate record o * This is "TODO: Why didn't we get paged for this earlier?" ## Incident Call Procedures and Lingo -The [Steps for Scribe](../during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. +The [Steps for Scribe](../during/during_an_incident.md/#steps-for-scribe) provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls. diff --git a/docs/training/subject_matter_expert.md b/docs/training/subject_matter_expert.md index b2f3c03..4606686 100644 --- a/docs/training/subject_matter_expert.md +++ b/docs/training/subject_matter_expert.md @@ -14,7 +14,7 @@ If you are on-call for your team, there are certain expectations of you as that 1. [Incident Call Etiquette](../before/call_etiquette.md) - How to behave during an incident call. 1. [During an Incident](../during/during_an_incident.md) - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document. 1. [Glossary](../training/glossary.md) - Familiarize yourself with the terminology that may be used during the call. -1. Make sure you have set up your alerting methods, and that PagerDuty can [bypass your "Do Not Disturb" settings](https://support.pagerduty.com/docs/notification-phone-numbers). +1. Make sure you have set up your alerting methods, and that PagerDuty can [bypass your "Do Not Disturb" settings](https://support.pagerduty.com/docs/notification-phone-numbers){:target="_blank"} . 1. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged. 1. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc. 1. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander.