Skip to content

Incident Response

Lisa Chung edited this page Aug 12, 2024 · 30 revisions

VRO is a moderately complex system that integrates with other systems and is hosted on a shared infrastructure. Incidents of unplanned service failures and disruptions are inevitable. The first objective of the incident response process is to restore a normal service operation as quickly as possible and minimize the incident's impact. This document describes the core steps for responding to incidents.

Definition

We define an incident as an unplanned event or occurrence that disrupts normal operations, services, or functions on the VRO platform. These negatively impact the availability, performance, security, or functionality of a VRO service and require immediate attention to mitigate its effects and restore normalcy. Incidents can vary widely in scope and severity and can be caused by factors in or out of the VRO team's control.

Incident Type Description
Service Outages Complete or partial unavailability of infrastructure services.
Performance Degradation Noticeable slowdown or inefficiency in infrastructure services.
Security Breaches Unauthorized access, data breaches, or vulnerabilities affecting infrastructure integrity, confidentiality, or availability.
Operational Failures Failures in deployment pipelines, configuration management, or automated processes impacting normal operations.
Resource Exhaustion Over-utilization or exhaustion of resources leading to degraded service.
Unexpected Behavior Anomalies or unexpected behaviors in infrastructure services affecting development, testing, or deployment activities.

Root Cause

The root cause of an incident is investigated and attributed to the team responsible for the issue’s origin - not necessarily the team addressing it - to ensure accurate reporting and unbiased resolution efforts. The GitHub issue for an incident should be annotated with the label corresponding to the root cause (label prefix: RC); and the root cause should also be documented in the writeup on Incident Reports. During the resolution process of an incident, the root cause is not included in updates or communications with partner teams to avoid bias and ensure impartiality.

Root Cause Type Description
VRO An issue on the VRO platform that ties directly to our team's scope of responsibilities. These incidents are tracked to measure our MTTR (Mean Time to Resolve) metric.
Partner Team Application or External VA An issue with the partner team application controlled by the partner team, or due to a VA external system not functioning appropriately
LHDI An issue on the LHDI platform. These are not within the VRO team's control, but the VRO team reports them to LHDI and works to resolve them in partnership with LHDI.

Responsibilities

As a default, incident response is the responsibility of the VRO primary on-call engineer, which rotates with every VRO sprint period. Throughout the process, they might personally conduct each step or delegate tasks as needed; regardless, a single individual should be identified as being in charge of the incident response. If this responsibility needs to be transferred while an incident is active, then this handoff should be explicitly communicated.

Tip

Be mindful of a channel's audience when posting updates in Slack. For general updates that will be seen by VRO partner teams, use #benefits-vro-support. For information that is primarily informative only to the VRO team, use #benefits-vro-on-call.

Slack channel #benefits-vro-support is designated for messaging about active incidents, to foster shared situational awareness among partner teams, VRO team, and stakeholders. During an incident, status updates should be provided at regular intervals and when key developments occur, to be prepared by the individual leading the incident response or a delegate. Sample language is provided in Triage. Strive for messaging that is succinct and specific, and avoid jargon or acronyms that non-VRO team members may not be familiar with.

Slack channel #benefits-vro-on-call should be used for more granular details that may not be informative to an audience beyond the VRO team.

On the VRO backlog (#3005) is developing more guidance on handling common on-call situations.

Working hours: 9 - 5pm EST

Process

Incident Response flowchart

Generated using draw.io. Source file: incidentResponse.drawio.txt (remove the .txt extension in order to use in draw.io)

Video demos of the Incident Report slack workflow: Partner Team View (2:13), VRO Team View (4:35).

Bookmarks
Partner Team View

0:00: Find the Incident Report bookmark in #benefits-vro-support.
0:07: Fill out the form.
0:56: Observe the automated post.
1:08: Observe the acknowledgement from the responding engineer.
1:24: Post additional comments in the thread.
1:44: Receive status updates in the thread.
2:03: Receive notification when the incident is resolved.

VRO Team View

0:01: React with 👀 on the Incident Report slack post (Step 0).
0:07: Click the Acknowledge button on the PagerDuty post (Step 0).
0:14: Post an initial update (Step 1).
0:20: Post internal notes to #benefits-vro-on-call (Step 2).
0:40: Post general updates to #benefits-vro-support (Step 2).
1:12: Click the Next Step button as tasks are completed.
1:38: Update the GitHub issue and the Incidents epic. (Step 3) .
2:51: Click the Next Step button as tasks are completed.
3:03: Log the incident on the wiki (Step 5).
4:10: Close the GitHub issue (Step 5).
4:20: Click the Next Step button as tasks are completed. 4:26: React with :support-complete: on the Incident Report slack post (Step 5).

Catalyst

The intake process for an incident is through the Incident Report Slack Workflow, bookmarked in channel #benefits-vro-support (screenshot). This slack workflow intake process applies to incidents discovered internally (for example, the on-call engineer detecting an issue) and externally (i.e. reported or escalated by a partner team or third party).

Step 0: Acknowledge

Step 0: Acknowledge

Description: Upon learning of a potential incident, acknowledge the report and begin investigating.

Purpose: reduce the likelihood of uncoordinated troubleshooting efforts, reduce panic, and establish consistent data points for calculating incident metrics

Estimated time to complete: 2 minutes

SLA: within 60 minutes of the report

Tasks are dependent on the source:

  • Source: Incident Report post in in #benefits-vro-support

    Tasks:

    1. React with 👀 on the Incident Report post in #benefits-vro-support.
    2. Click the Acknowledge button on the PagerDuty post in #benefits-vro-on-call.
  • Source: Slack post other than the Incident Report

    Tasks:

    1. React with 👀 to the post.
    2. Submit an Incident Report using the Slack workflow.
    3. React with 👀 on the Incident Report post in #benefits-vro-support.

    If the source is a person and they are not in #benefits-vro-support, direct them to the channel.

  • Source: Email message from a known partner or stakeholder

    Tasks:

    1. Email a response: "Investigating. If you have access to the VA OCTO slack, I will be tracking this in #benefits-vro-support."
    2. Submit an Incident Report using the slack workflow.
    3. React with 👀 on the Incident Report post in #benefits-vro-support.
  • Source: Email message from an unknown party

    Task: Consult the VRO team and OCTO Enablement Team.

Step 1: Triage

Step 1: Triage

Description: Conduct a brief assessment. Determine an initial severity level (SEV) and affected systems. Share initial assessment and intended next steps.

Purpose: gain situational awareness; respond with appropriate urgency

Estimated time to complete: 10 minutes

SLA: initial assessment within 30 minutes of Acknowledgement

Tasks:

  1. Post an initial update as a thread of the automated Incident Report message in #benefits-vro-support channel. Use a message appropriate for the assessed Severity level (see language below).

Initial update by severity level:

SEV 1: Critical

Description: Core functionality is unavailable or buggy

Examples: a VRO app appears offline; a VRO app's transactions are failing; a VRO data service appears unresponsive; inaccurate data is transmitted

Priority: Immediate investigation

Initial Message:

We will provide detailed updates on our findings and actions taken every ~30 minutes until the issue is resolved.

Frequency: every 30 min

SEV 2: High

Description: Core functionality is degraded

Examples: increased latency; increased retry attempts

Priority: Immediate investigation

Initial Message:

We will provide detailed updates on our findings and actions taken every ~30 - 60 minutes until the issue is resolved.

Frequency: every 30-60 minutes

SEV 3: Medium

Description: Unexpected metrics related to core functionality, although without noticeable performance degradation

Examples: sustained increase in CPU utilization; sustained increased in open database connections

Priority: continued passive monitoring; within the next business day: investigation into what is causing the issue

Initial Message:

We will conduct a thorough investigation within the next business day and provide hourly updates as more information becomes available.

Frequency: Daily

SEV 4: Low

Description: Non-core functionality is affected

Examples: gaps or increased latency in transmitting data to an analytics platform

Priority: immediate investigation, limited to identifying root cause

Initial Message:

We will investigate the root cause at our earliest convenience and provide updates as necessary.

Frequency: As needed

Template for subsequent update messages:

- Current status and progress made so far (including the root cause of the issue if identified)
- Specific actions taken since the last update
- Any changes to the estimated resolution time
- Next update time

:idea: Use Slack's /remind feature to alert you to when the next update is due. For example: /remind me in 30 minutes to post an update.

Step 2: Contain/Stabilize

Step 2: Contain/Stabilize

Description: Work to prevent further damage. Share regular status updates.

Purpose: containing the situation might provide more immediate relief than implementing a remediation

Estimated time to complete: varies

SLA: SEV 1, SEV 2: top priority for the on-call engineer; SEV 3, SEV 4: as soon as VRO can prioritize

Considerations: Is there a configuration change to prevent requests to the buggy system? Would an increase in compute resources temporarily stabilize the system?

Tasks:

  1. Post internal notes to #benefits-vro-on-call.
  2. Post general updates to the Incident Report thread in #benefits-vro-support within the frequency defined for the respective Severity.
  3. Click the Next Step button on the Incident Response workflow.
Step 3: Remediate (short-term)

Step 3: Remediate (short-term)

Description: Get the system back to a minimally acceptable operating status.

Purpose: Reduce likelihood that the incident will reoccur in the short-term; increase likelihood that the system will be stable.

Estimated time to complete: varies

SLA: SEV 1, SEV 2: top priority for the on-call engineer; SEV 3, SEV 4: as soon as VRO can prioritize

Considerations: Should compute resources be recalibrated? Would a rollback or roll-forward of code/configuration would be appropriate and feasible?

Tasks:

  1. Post general updates to the Incident Report thread in #benefits-vro-support within the frequency defined for the respective Severity.
  2. Update the GitHub issue that was created by the Incident Report workflow.
  • Add blue VRO-team label
  • Add root cause label
    • RC VRO
    • RC LHDI
    • RC Partner Team or External VA
  • Assign to the engineer(s) responding to the incident
  • Include SEV level
  • Add to the current sprint
  • Add to the Incidents Epic
  • Arrange to discuss the incident as a 16th minute item during Daily Scrum
  1. Click the Next Step button on the Incident Response workflow.
Step 4: Monitor

Step 4: Monitor

Description: Look for data points that indicate the incident is under control. As needed, return to Step 2 and Step 3.

Purpose: gain confidence that the incident is under control

Estimated time to complete: at minimum: 30 minutes

SLA: n/a

Tasks:

  1. (as needed) Post internal notes to #benefits-vro-on-call.
  2. Post general updates to the Incident Report thread in #benefits-vro-support within the frequency defined for the respective Severity.
Step 5: Log the incident

Step 5: Log the incident

Description: Document the incident in the VRO wiki's Incident Reports.

Purpose: build a record of incidents that can reveal patterns, inform engineering decisions, and be a general resource

Estimated time to complete: 30 minutes

SLA: within 8 business hrs of the incident's resolution

Considerations: Account for these details: how the incident was detected, including timestamp; severity level; corrective measures taken; timestamp of when system returned to operating status; "red herrings" that were encountered; follow-up tasks

Tasks:

  1. Create an Incident Report on the wiki page.
  2. Close the GitHub issue that was created by the Incident Report workflow.
  3. React with :support-complete: on the Incident Report post in #benefits-vro-support.
Step 6: Post-incident review

Step 6: Post-incident review

Description: As a more in-depth analysis, assess what happened, what went well, what did not go well, and measures to prevent a recurrence. Describe troubleshooting measures, including log snippets and command line tools. Share this document with the VRO team and with partner teams. Expected for SEV 1 and SEV 2 incidents; at team's discretion for SEV 3 and SEV 4 incidents.

Purpose: leverage the incident as a learning opportunity; surface further corrective measures

Estimated time to complete: 4 hours

SLA: within 5 business days of the incident's resolution

Considerations: Follow principles of blameless post-mortems:

focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.

Tasks:

  1. Create a Post-incident review on the private wiki.
Step 7: Discuss longer-term remediation

Step 7: Discuss longer-term remediation

Description: Determine measures that would reduce the likelihood of this incident recurring and/or give the team better visibility into conditions that led to this incident. Propose these to the Product Manager for consideration.

Purpose: Consider remediation measures that could not be achieved in the short term response.

Estimated time to complete: varies

SLA: within 2 sprint cycles of the incident's resolution

Tasks:

  1. Document recommended remediation steps.
  2. Share these with the Product Manager.
  3. Be prepared to present the steps during a backlog refinement session.
Clone this wiki locally