Added reliability risk

risk-first · Dec 27, 2024 · 37b9327 · 37b9327
1 parent 3bb47aa
commit 37b9327
Show file tree

Hide file tree

Showing 65 changed files with 5,945 additions and 1,794 deletions.
diff --git a/dictionary.txt b/dictionary.txt
@@ -365,3 +365,8 @@ ratchets
 laborious
 reimplement
 devolved
+microservices
+serviceability
+automakers
+pinto
+uptime
diff --git a/docs/practices/Communication-And-Collaboration/Review.md b/docs/practices/Communication-And-Collaboration/Review.md
@@ -23,7 +23,9 @@ practice:
    - tag: Agency Risk
      reason: "Reviewing work or activity can ensure good behaviour."
    - tag: Internal Model Risk
-     reason: "Reviews and audits can uncover unseen problems in a system"
+     reason: "Reviews and audits can uncover unseen problems in a system."
+   - tag: Reliability Risk
+     reason: "Reviews and audits can be performed to investigate the causes of unreliability in a system."
   attendant:
    - tag: Schedule Risk
      reason: "Reviews can introduce delays in the project timeline."

diff --git a/docs/practices/Communication-And-Collaboration/Training.md b/docs/practices/Communication-And-Collaboration/Training.md
@@ -22,8 +22,6 @@ practice:
   attendant:
    - tag: Schedule Risk
      reason: "Training sessions can take time away from development, impacting schedules."
-   - tag: Reliability Risk
-     reason: "Creates a dependency on training programs and their effectiveness."
   related:
    - ../Documentation
    - ../Development-and-Coding/Pair-Programming

diff --git a/docs/practices/Deployment-And-Operations/Configuration-Management.md b/docs/practices/Deployment-And-Operations/Configuration-Management.md
@@ -23,7 +23,7 @@ practice:
      reason: "Reduces complexity by managing system changes in a controlled and documented manner."
   attendant:
    - tag: Reliability Risk
-     reason: "Dependencies on the CM tools and processes can become critical points of failure."
+     reason: "Carefully managing software configuration ensures that the reliability of dependencies is also managed."
    - tag: Security Risk
      reason: "Incorrect configuration management can lead to security vulnerabilities."
   related:

diff --git a/docs/practices/Deployment-And-Operations/Demand-Management.md b/docs/practices/Deployment-And-Operations/Demand-Management.md
@@ -16,7 +16,7 @@ practice:
    - "Resource Scaling"
   mitigates:
    - tag: Reliability Risk
-     reason: "Helps in efficiently allocating resources to meet the demand without overburdening the team."
+     reason: "Helps in efficiently allocating scarce dependencies to meet the most critical demands."
    - tag: Deadline Risk
      reason: "Ensures that the demand is managed to meet delivery schedules."
    - tag: Market Risk

diff --git a/docs/practices/Deployment-And-Operations/Redundancy.md b/docs/practices/Deployment-And-Operations/Redundancy.md
@@ -12,6 +12,7 @@ practice:
    - "Backup"
    - "Failover"
    - "Resilience"
+   - "Stockpiling"
   mitigates:
    - tag: Feature Risk
      reason: "Ensures system availability and reliability in case of component failure."

diff --git a/docs/practices/Deployment-And-Operations/Release.md b/docs/practices/Deployment-And-Operations/Release.md
@@ -28,6 +28,8 @@ practice:
      reason: "Releasing software means that the software has to be supported in production."
    - tag: Process Risk
      reason: "Complex release procedures are a source of process risk."
+   - tag: Reliability Risk
+     reason: "Releases can introduce discontinuities in software service if not managed well."
   related:
    - ../Planning-and-Management/Change-Management
    - ../Tools-and-Standards/Version-Control

diff --git a/docs/practices/Development-And-Coding/Debugging.md b/docs/practices/Development-And-Coding/Debugging.md
@@ -18,12 +18,10 @@ practice:
    - tag: Operational Risk
      reason: "Ensures that the software operates correctly and efficiently."
    - tag: Reliability Risk
-     reason: "Improves the reliability and stability of the software."
+     reason: "Removing bugs improves the reliability and stability of the software."
   attendant:
    - tag: Schedule Risk
      reason: "Debugging can be time-consuming, affecting project timelines."
-   - tag: Reliability Risk
-     reason: "Debugging may reveal dependencies on other systems or components."
   related:
    - ../Development-and-Coding/Coding
    - ../Testing-and-Quality-Assurance/Integration-Testing

diff --git a/docs/practices/Development-And-Coding/Pair-Programming.md b/docs/practices/Development-And-Coding/Pair-Programming.md
@@ -20,7 +20,9 @@ practice:
    - tag: Learning Curve Risk
      reason: "Facilitates knowledge sharing and learning."
    - tag: Implementation Risk
-     reason: "More eyeballs means fewer bugs and a better implementation"     
+     reason: "More eyeballs means fewer bugs and a better implementation" 
+   - tag: Reliability Risk
+     reason: "More developers may be able to produce a more reliable implementation."    
   attendant:
    - tag: Coordination Risk
      reason: "Requires coordination around time, place, activity and skills."

diff --git a/docs/practices/Testing-and-Quality-Assurance/Performance-Testing.md b/docs/practices/Testing-and-Quality-Assurance/Performance-Testing.md
@@ -15,6 +15,8 @@ practice:
   mitigates:
    - tag: Feature Access Risk
      reason: "Identifies performance bottlenecks that could impact operations."
+   - tag: Reliability Risk
+     reason: "Performance testing software can establish bounds on its reliability."
   attendant:
    - tag: Schedule Risk
      reason: "Can be time-consuming, leading to delays in the project timeline."

diff --git a/docs/practices/Testing-and-Quality-Assurance/Regression-Testing.md b/docs/practices/Testing-and-Quality-Assurance/Regression-Testing.md
@@ -15,6 +15,8 @@ practice:
   mitigates:
    - tag: Regression Risk
      reason: "Detects and prevents regressions in the software."
+   - tag: Reliability Risk
+     reason: "Regression testing helps prevent reliability breaks caused by software change."
   attendant:
    - tag: Schedule Risk
      reason: "Can be time-consuming and introduce delays."

diff --git a/docs/risks/Dependency-Risks/Reliability-Risk/Reliability-Risk.md b/docs/risks/Dependency-Risks/Reliability-Risk/Reliability-Risk.md
@@ -1,6 +1,6 @@
 ---
 title: Reliability Risk
-description: Risks of not getting benefit from a dependency due to it's reliability, either now or in the future.
+description: Risks of not getting benefit from a dependency due to it's reliability.
 
 slug: /risks/Reliability-Risk
 featured: 
@@ -16,26 +16,92 @@ part_of: Dependency Risk
 
 <RiskIntro fm={frontMatter} />
 
-This points to the problem that when we use an external dependency, we are at the mercy of its reliability.   
+Whenever we use an external dependency, we are at the mercy of its reliability.   
 
 > "... Reliability describes the ability of a system or component to function under stated conditions for a specified period of time." - [Reliability Engineering, _Wikipedia_](https://en.m.wikipedia.org/wiki/Reliability_engineering)
 
-![Reliability Risk](/img/generated/risks/dependency/reliability-risk.svg) 
+It's easy to think about reliability for something like a bus:  sometimes, it's late due to weather, or cancelled due to driver sickness, or the route changes unexpectedly due to road works.  In software, it's no different:  _unreliability_ is the flip-side of [Feature Implementation Risk](/tags/Implementation-Risk).  It's caused in the gap between the real behaviour of the software and our expectations for it.
 
-It's easy to think about reliability for something like a bus:  sometimes, it's late due to weather, or cancelled due to driver sickness, or the route changes unexpectedly due to road works.  
+## Worked Example
 
-In software, it's no different:  _unreliability_ is the flip-side of [Feature Implementation Risk](/tags/Implementation-Risk).  It's caused in the gap between the real behaviour of the software and the expectations for it.
+![Improving Reliability with Redundancy](/img/generated/risks/posters/reliability-risk.svg)
 
-There is an upper bound on the reliability of the software you write, and this is based on the dependencies you use and (in turn) the reliability of those dependencies:
+A team builds a web service using [a three-tier architecture](https://www.ibm.com/topics/three-tier-architecture) running on in-house servers.  However in practice, they find that their service has reliability issues: the single-node database often gets overloaded with requests and acts as a single choke-point for the whole system.
+
+In response, they decide to add further redundancy to the database tier.   However, this now introduces two further issues.  First, the application tier needs to choose which database node to route to, (adding [Complexity](/tags/Complexity-Risk) to the design.  Second, the database nodes need to [Coordinate](/tags/Coordination-Risk) and ensure that they present a consistent version of reality.  
+
+Most modern database management systems provide this kind of replication / synchronisation functionality, but it comes at the expense of node throughput, and cloud database services provide _elastic_ scaling for this kind of issue (see below).    
+
+
+## Types of Reliability / Example Threats
+
+### 1. Availability  
+
+Availability is a measure of reliability often used for services, often expressed as _uptime_ (the percentage of the the service is up) or _mean time between failures (MTBF)_.  However, availability is often a function of how heavily a service is being used so requirements around availability are often expressed as _a percentage of requests within a given time_ or an _error rate_ for requests as a whole.   
+
+**Threat:** An online service that you want to use doesn't publish [Service Level Agreements (SLAs)](https://en.wikipedia.org/wiki/Service-level_agreement) and therefore it's hard to build software reliably on top of it.
+
+### 2. Quality of Service (QoS)  
+
+Often a service dependency can be responding quickly (i.e. have good availability) but still perform inadequately, perhaps with wrong or sloppy results.  
+
+**Threat:** Performance testing establishes the operating bounds of a dependency, but not whether it is operating _correctly_ in those bounds.
+
+### 3. Maintainability and Serviceability
+
+A dependency that is easy to maintain, service and reconfigure or repair is more reliable.  
+
+**Threat**: The dependency is hard to introspect, making it difficult to diagnose issues.
+
+**Threat**: The dependency has no easy way to make upgrades and changes, except without bringing it down, damaging reliability.
+
+### 4. Scalability
 
- - If a component **A** depends on component **B**, unless there is some extra redundancy around **B**, then **A** _can't_ be more reliable than **B**.
- - Is **A** or **B** a [Single Point Of Failure](https://en.wikipedia.org/wiki/Single_point_of_failure) in a system?
- - Are there bugs in **B** that are going to prevent it working correctly in all circumstances?
+Scalability measures how well a system will perform as the workload increases.  Often, a dependency will scale in a predictable way up to a point and then become completely unreliable. It's often important to establish those limits.
+
+**Threat**: A dependency appears to scale well under test conditions but then can fail spectacularly under more extreme real-life conditions.
+
+### 5. Elasticity
+
+_Elasticity_ expresses the concept of auto-scaling.  Many cloud services now are designed to auto-scale.  That is, more resources are added to the service (such as web servers, compute nodes etc) when the service is under heavy load. 
+
+**Threat**: You configure auto-scaling for a dependency, but are then caught out later when it runs way over the expected cost due to some poor performance at higher load levels.   
+
+### 6. Robustness
+
+Robustness measures of how well a service handles and recovers from failures.  A waterproof watch is more robust than a non-waterproof one as it can survive in more hostile scenarios.
+
+**Threat:** Unusual events test the robustness of dependencies.  For example, outages, network issues, denial-of-service attacks.
+
+### 7. Fault Tolerance
+
+Systems are often composed of other systems (especially in a microservices architecture).  This often means that the system as a whole is never 100% "up" or "down".  How does the system route around issues when parts are failing?
+
+**Threat:** Systems composed of multiple agents can often have "grey failures" where some parts of the system fail and others don't.  This makes anticipating the behaviour of the system under these conditions very difficult.
+
+### 8. Safety
+
+When the system fails, does it fail-safe?  Or will there be catastrophe?  Will transactions somehow be left incomplete?  
+
+**Threat**: The billing dependency drops requests under load, resulting in you under-billing your customers for their usage of your software.
+
+
+## Reliability Engineering
 
 This kind of stuff is encapsulated in the science of [Reliability Engineering](https://en.wikipedia.org/wiki/Reliability_engineering).   For example, [Failure Mode and Effects Analysis (FEMA)](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis):
 
 > "...was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. " - [FEMA, _Wikipedia_](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis)
 
-This was applied on NASA missions, and then in the 1970's to car design following the [Ford Pinto exploding car](https://en.wikipedia.org/wiki/Ford_Pinto#Design_flaws_and_ensuing_lawsuits) affair.  But establishing the reliability of software dependencies like this would be _hard_ and _expensive_.  We are more likely to mitigate [Reliability Risk](/tags/Reliability-Risk) in software using _testing_, _redundancy_ and _reserves_, as shown in the diagram above.  
+This was applied on NASA missions, and then in the 1970's to car design following the Ford Pinto exploding car (see below).  But establishing the reliability of software dependencies like this would be _hard_ and _expensive_.  We are more likely to mitigate [Reliability Risk](/tags/Reliability-Risk) in software using [performance testing](/tags/Performance-Testing) or [redundancy](/tags/Redundancy) as shown in the diagram above.  
+
+Additionally, we often rely on _proxies for reliability_.  We'll look at these proxies (and the way in which software projects signal their reliability) in much more detail in the section on [Software Dependency Risk](/tags/Software-Dependency-Risk).
+
+:::tip Anecdote Corner
+
+In the 1970's Ford introduced a car called the Pinto.  In order to reduce costs, the fuel tank was placed behind the rear axle making it liable to get punctured if the car was rear-ended.  Ford crash-tested the car and discovered the fault but decided _against_ reinforcing it as it would have cost a further $11 per car.  
+
+Sadly, this led to hundreds of injuries and deaths - as well as public outrage.  This incident led to automakers adopting the practice of reliability engineering going forwards:  prioritising testing, failure analysis and integrating safety features.
+
+:::
+
 
-Additionally, we often rely on _proxies for reliability_.  We'll look at these proxies (and the way in which software projects signal their reliability) in much more detail in the section on [Software Dependency Risk](/tags/Software-Dependency-Risk).
diff --git a/docs/risks/Internal-Model-Risk/Internal-Model-Risk.md b/docs/risks/Internal-Model-Risk/Internal-Model-Risk.md
@@ -47,7 +47,7 @@ But this is flawed, for the following reasons:
 
 In 1973, Fischer Black and Myron Scholes published their ground-breaking paper describing the [Black-Scholes-Merton model](https://en.wikipedia.org/wiki/Black–Scholes_model) for pricing options.  Pricing options (agreements to give someone the option to buy or sell something at a later date and price) had previously been hugely problematic, so the creation of a model that would do it correctly was a huge step forward and earned Merton and Scholes the 1997 Nobel Prize for Economics (Black had died in 1995 and was thus ineligible).  
 
-Long-Term Capital Management (LTCM) was founded in 1994 and was, for a while, a hugely successful hedge fund.  Scholes and Merton sat on the board, which, along with incredible returns lent the organisation a strong reputation.  However, the models underlying their impressive returns were faulty.  They were based on historical correlations and made assumptions about liquidity.
+Long-Term Capital Management (LTCM) was founded in 1994 and was, for a while, a hugely successful hedge fund.  Scholes and Merton sat on the board, which, along with incredible returns lent the organisation a strong reputation.  However, the models underlying their impressive returns were faulty: they were based on historical correlations (which might not hold in the future) and made assumptions about liquidity.
 
 In 1997, a confluence of market conditions (the Asian Financial Crisis and Russian Debt Default) uncovered these weaknesses and the firm lost 90% of its value, exceeding $4bn, forcing the US government to stage a bail-out.  
 

diff --git a/numbers/Practices.numbers b/numbers/Practices.numbers
diff --git a/src/images/generated/risks/posters/reliability-risk.adl b/src/images/generated/risks/posters/reliability-risk.adl
@@ -0,0 +1,28 @@
+<diagram
+	xslt:template="/public/templates/risk-first/risk-first-template.xsl"
+	xmlns:xslt="http://www.kite9.org/schema/xslt"
+	xmlns="http://www.kite9.org/schema/adl"
+	xmlns:svg="http://www.w3.org/2000/svg"
+	xmlns:xlink="http://www.w3.org/1999/xlink" id="diagram-113"
+	style="--kite9-min-width: 900pt;">
+
+	<container id="c" bordered="true">
+		<mitigated>
+			<risk class="reliability" style="--kite9-horizontal-align: left;" />
+		</mitigated>
+		<label>Running the database on a single node
+		is deemed unlikely to be reliable enough.</label>
+	</container>
+
+	<action style="--kite9-horizontal-align: left;">Redundancy</action>
+
+	<container id="d" style="--kite9-sizing: maximize; ">
+
+		<risk class="complexity" />		
+		<risk class="coordination" />
+
+		<label id="id_16">Attendant Risks </label>
+	</container>
+
+
+</diagram>