Skip to content

Commit

Permalink
Added reliability risk
Browse files Browse the repository at this point in the history
  • Loading branch information
robmoffat committed Dec 27, 2024
1 parent 3bb47aa commit 37b9327
Show file tree
Hide file tree
Showing 65 changed files with 5,945 additions and 1,794 deletions.
5 changes: 5 additions & 0 deletions dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -365,3 +365,8 @@ ratchets
laborious
reimplement
devolved
microservices
serviceability
automakers
pinto
uptime
4 changes: 3 additions & 1 deletion docs/practices/Communication-And-Collaboration/Review.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ practice:
- tag: Agency Risk
reason: "Reviewing work or activity can ensure good behaviour."
- tag: Internal Model Risk
reason: "Reviews and audits can uncover unseen problems in a system"
reason: "Reviews and audits can uncover unseen problems in a system."
- tag: Reliability Risk
reason: "Reviews and audits can be performed to investigate the causes of unreliability in a system."
attendant:
- tag: Schedule Risk
reason: "Reviews can introduce delays in the project timeline."
Expand Down
2 changes: 0 additions & 2 deletions docs/practices/Communication-And-Collaboration/Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,6 @@ practice:
attendant:
- tag: Schedule Risk
reason: "Training sessions can take time away from development, impacting schedules."
- tag: Reliability Risk
reason: "Creates a dependency on training programs and their effectiveness."
related:
- ../Documentation
- ../Development-and-Coding/Pair-Programming
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ practice:
reason: "Reduces complexity by managing system changes in a controlled and documented manner."
attendant:
- tag: Reliability Risk
reason: "Dependencies on the CM tools and processes can become critical points of failure."
reason: "Carefully managing software configuration ensures that the reliability of dependencies is also managed."
- tag: Security Risk
reason: "Incorrect configuration management can lead to security vulnerabilities."
related:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ practice:
- "Resource Scaling"
mitigates:
- tag: Reliability Risk
reason: "Helps in efficiently allocating resources to meet the demand without overburdening the team."
reason: "Helps in efficiently allocating scarce dependencies to meet the most critical demands."
- tag: Deadline Risk
reason: "Ensures that the demand is managed to meet delivery schedules."
- tag: Market Risk
Expand Down
1 change: 1 addition & 0 deletions docs/practices/Deployment-And-Operations/Redundancy.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ practice:
- "Backup"
- "Failover"
- "Resilience"
- "Stockpiling"
mitigates:
- tag: Feature Risk
reason: "Ensures system availability and reliability in case of component failure."
Expand Down
2 changes: 2 additions & 0 deletions docs/practices/Deployment-And-Operations/Release.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ practice:
reason: "Releasing software means that the software has to be supported in production."
- tag: Process Risk
reason: "Complex release procedures are a source of process risk."
- tag: Reliability Risk
reason: "Releases can introduce discontinuities in software service if not managed well."
related:
- ../Planning-and-Management/Change-Management
- ../Tools-and-Standards/Version-Control
Expand Down
4 changes: 1 addition & 3 deletions docs/practices/Development-And-Coding/Debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,10 @@ practice:
- tag: Operational Risk
reason: "Ensures that the software operates correctly and efficiently."
- tag: Reliability Risk
reason: "Improves the reliability and stability of the software."
reason: "Removing bugs improves the reliability and stability of the software."
attendant:
- tag: Schedule Risk
reason: "Debugging can be time-consuming, affecting project timelines."
- tag: Reliability Risk
reason: "Debugging may reveal dependencies on other systems or components."
related:
- ../Development-and-Coding/Coding
- ../Testing-and-Quality-Assurance/Integration-Testing
Expand Down
4 changes: 3 additions & 1 deletion docs/practices/Development-And-Coding/Pair-Programming.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ practice:
- tag: Learning Curve Risk
reason: "Facilitates knowledge sharing and learning."
- tag: Implementation Risk
reason: "More eyeballs means fewer bugs and a better implementation"
reason: "More eyeballs means fewer bugs and a better implementation"
- tag: Reliability Risk
reason: "More developers may be able to produce a more reliable implementation."
attendant:
- tag: Coordination Risk
reason: "Requires coordination around time, place, activity and skills."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ practice:
mitigates:
- tag: Feature Access Risk
reason: "Identifies performance bottlenecks that could impact operations."
- tag: Reliability Risk
reason: "Performance testing software can establish bounds on its reliability."
attendant:
- tag: Schedule Risk
reason: "Can be time-consuming, leading to delays in the project timeline."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ practice:
mitigates:
- tag: Regression Risk
reason: "Detects and prevents regressions in the software."
- tag: Reliability Risk
reason: "Regression testing helps prevent reliability breaks caused by software change."
attendant:
- tag: Schedule Risk
reason: "Can be time-consuming and introduce delays."
Expand Down
88 changes: 77 additions & 11 deletions docs/risks/Dependency-Risks/Reliability-Risk/Reliability-Risk.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Reliability Risk
description: Risks of not getting benefit from a dependency due to it's reliability, either now or in the future.
description: Risks of not getting benefit from a dependency due to it's reliability.

slug: /risks/Reliability-Risk
featured:
Expand All @@ -16,26 +16,92 @@ part_of: Dependency Risk

<RiskIntro fm={frontMatter} />

This points to the problem that when we use an external dependency, we are at the mercy of its reliability.
Whenever we use an external dependency, we are at the mercy of its reliability.

> "... Reliability describes the ability of a system or component to function under stated conditions for a specified period of time." - [Reliability Engineering, _Wikipedia_](https://en.m.wikipedia.org/wiki/Reliability_engineering)
![Reliability Risk](/img/generated/risks/dependency/reliability-risk.svg)
It's easy to think about reliability for something like a bus: sometimes, it's late due to weather, or cancelled due to driver sickness, or the route changes unexpectedly due to road works. In software, it's no different: _unreliability_ is the flip-side of [Feature Implementation Risk](/tags/Implementation-Risk). It's caused in the gap between the real behaviour of the software and our expectations for it.

It's easy to think about reliability for something like a bus: sometimes, it's late due to weather, or cancelled due to driver sickness, or the route changes unexpectedly due to road works.
## Worked Example

In software, it's no different: _unreliability_ is the flip-side of [Feature Implementation Risk](/tags/Implementation-Risk). It's caused in the gap between the real behaviour of the software and the expectations for it.
![Improving Reliability with Redundancy](/img/generated/risks/posters/reliability-risk.svg)

There is an upper bound on the reliability of the software you write, and this is based on the dependencies you use and (in turn) the reliability of those dependencies:
A team builds a web service using [a three-tier architecture](https://www.ibm.com/topics/three-tier-architecture) running on in-house servers. However in practice, they find that their service has reliability issues: the single-node database often gets overloaded with requests and acts as a single choke-point for the whole system.

In response, they decide to add further redundancy to the database tier. However, this now introduces two further issues. First, the application tier needs to choose which database node to route to, (adding [Complexity](/tags/Complexity-Risk) to the design. Second, the database nodes need to [Coordinate](/tags/Coordination-Risk) and ensure that they present a consistent version of reality.

Most modern database management systems provide this kind of replication / synchronisation functionality, but it comes at the expense of node throughput, and cloud database services provide _elastic_ scaling for this kind of issue (see below).


## Types of Reliability / Example Threats

### 1. Availability

Availability is a measure of reliability often used for services, often expressed as _uptime_ (the percentage of the the service is up) or _mean time between failures (MTBF)_. However, availability is often a function of how heavily a service is being used so requirements around availability are often expressed as _a percentage of requests within a given time_ or an _error rate_ for requests as a whole.

**Threat:** An online service that you want to use doesn't publish [Service Level Agreements (SLAs)](https://en.wikipedia.org/wiki/Service-level_agreement) and therefore it's hard to build software reliably on top of it.

### 2. Quality of Service (QoS)

Often a service dependency can be responding quickly (i.e. have good availability) but still perform inadequately, perhaps with wrong or sloppy results.

**Threat:** Performance testing establishes the operating bounds of a dependency, but not whether it is operating _correctly_ in those bounds.

### 3. Maintainability and Serviceability

A dependency that is easy to maintain, service and reconfigure or repair is more reliable.

**Threat**: The dependency is hard to introspect, making it difficult to diagnose issues.

**Threat**: The dependency has no easy way to make upgrades and changes, except without bringing it down, damaging reliability.

### 4. Scalability

- If a component **A** depends on component **B**, unless there is some extra redundancy around **B**, then **A** _can't_ be more reliable than **B**.
- Is **A** or **B** a [Single Point Of Failure](https://en.wikipedia.org/wiki/Single_point_of_failure) in a system?
- Are there bugs in **B** that are going to prevent it working correctly in all circumstances?
Scalability measures how well a system will perform as the workload increases. Often, a dependency will scale in a predictable way up to a point and then become completely unreliable. It's often important to establish those limits.

**Threat**: A dependency appears to scale well under test conditions but then can fail spectacularly under more extreme real-life conditions.

### 5. Elasticity

_Elasticity_ expresses the concept of auto-scaling. Many cloud services now are designed to auto-scale. That is, more resources are added to the service (such as web servers, compute nodes etc) when the service is under heavy load.

**Threat**: You configure auto-scaling for a dependency, but are then caught out later when it runs way over the expected cost due to some poor performance at higher load levels.

### 6. Robustness

Robustness measures of how well a service handles and recovers from failures. A waterproof watch is more robust than a non-waterproof one as it can survive in more hostile scenarios.

**Threat:** Unusual events test the robustness of dependencies. For example, outages, network issues, denial-of-service attacks.

### 7. Fault Tolerance

Systems are often composed of other systems (especially in a microservices architecture). This often means that the system as a whole is never 100% "up" or "down". How does the system route around issues when parts are failing?

**Threat:** Systems composed of multiple agents can often have "grey failures" where some parts of the system fail and others don't. This makes anticipating the behaviour of the system under these conditions very difficult.

### 8. Safety

When the system fails, does it fail-safe? Or will there be catastrophe? Will transactions somehow be left incomplete?

**Threat**: The billing dependency drops requests under load, resulting in you under-billing your customers for their usage of your software.


## Reliability Engineering

This kind of stuff is encapsulated in the science of [Reliability Engineering](https://en.wikipedia.org/wiki/Reliability_engineering). For example, [Failure Mode and Effects Analysis (FEMA)](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis):

> "...was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. " - [FEMA, _Wikipedia_](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis)
This was applied on NASA missions, and then in the 1970's to car design following the [Ford Pinto exploding car](https://en.wikipedia.org/wiki/Ford_Pinto#Design_flaws_and_ensuing_lawsuits) affair. But establishing the reliability of software dependencies like this would be _hard_ and _expensive_. We are more likely to mitigate [Reliability Risk](/tags/Reliability-Risk) in software using _testing_, _redundancy_ and _reserves_, as shown in the diagram above.
This was applied on NASA missions, and then in the 1970's to car design following the Ford Pinto exploding car (see below). But establishing the reliability of software dependencies like this would be _hard_ and _expensive_. We are more likely to mitigate [Reliability Risk](/tags/Reliability-Risk) in software using [performance testing](/tags/Performance-Testing) or [redundancy](/tags/Redundancy) as shown in the diagram above.

Additionally, we often rely on _proxies for reliability_. We'll look at these proxies (and the way in which software projects signal their reliability) in much more detail in the section on [Software Dependency Risk](/tags/Software-Dependency-Risk).

:::tip Anecdote Corner

In the 1970's Ford introduced a car called the Pinto. In order to reduce costs, the fuel tank was placed behind the rear axle making it liable to get punctured if the car was rear-ended. Ford crash-tested the car and discovered the fault but decided _against_ reinforcing it as it would have cost a further $11 per car.

Sadly, this led to hundreds of injuries and deaths - as well as public outrage. This incident led to automakers adopting the practice of reliability engineering going forwards: prioritising testing, failure analysis and integrating safety features.

:::


Additionally, we often rely on _proxies for reliability_. We'll look at these proxies (and the way in which software projects signal their reliability) in much more detail in the section on [Software Dependency Risk](/tags/Software-Dependency-Risk).
2 changes: 1 addition & 1 deletion docs/risks/Internal-Model-Risk/Internal-Model-Risk.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ But this is flawed, for the following reasons:

In 1973, Fischer Black and Myron Scholes published their ground-breaking paper describing the [Black-Scholes-Merton model](https://en.wikipedia.org/wiki/Black–Scholes_model) for pricing options. Pricing options (agreements to give someone the option to buy or sell something at a later date and price) had previously been hugely problematic, so the creation of a model that would do it correctly was a huge step forward and earned Merton and Scholes the 1997 Nobel Prize for Economics (Black had died in 1995 and was thus ineligible).

Long-Term Capital Management (LTCM) was founded in 1994 and was, for a while, a hugely successful hedge fund. Scholes and Merton sat on the board, which, along with incredible returns lent the organisation a strong reputation. However, the models underlying their impressive returns were faulty. They were based on historical correlations and made assumptions about liquidity.
Long-Term Capital Management (LTCM) was founded in 1994 and was, for a while, a hugely successful hedge fund. Scholes and Merton sat on the board, which, along with incredible returns lent the organisation a strong reputation. However, the models underlying their impressive returns were faulty: they were based on historical correlations (which might not hold in the future) and made assumptions about liquidity.

In 1997, a confluence of market conditions (the Asian Financial Crisis and Russian Debt Default) uncovered these weaknesses and the firm lost 90% of its value, exceeding $4bn, forcing the US government to stage a bail-out.

Expand Down
Binary file modified numbers/Practices.numbers
Binary file not shown.
28 changes: 28 additions & 0 deletions src/images/generated/risks/posters/reliability-risk.adl
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<diagram
xslt:template="/public/templates/risk-first/risk-first-template.xsl"
xmlns:xslt="http://www.kite9.org/schema/xslt"
xmlns="http://www.kite9.org/schema/adl"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" id="diagram-113"
style="--kite9-min-width: 900pt;">

<container id="c" bordered="true">
<mitigated>
<risk class="reliability" style="--kite9-horizontal-align: left;" />
</mitigated>
<label>Running the database on a single node
is deemed unlikely to be reliable enough.</label>
</container>

<action style="--kite9-horizontal-align: left;">Redundancy</action>

<container id="d" style="--kite9-sizing: maximize; ">

<risk class="complexity" />
<risk class="coordination" />

<label id="id_16">Attendant Risks </label>
</container>


</diagram>
Loading

0 comments on commit 37b9327

Please sign in to comment.