Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark RSR docs for FilOz review #6

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .tools/fix-notion-relative-links.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import re

# This helper file was created to fix the relative links that are busted from a Notion export.
# The Python script is more robust than a sed-one-liner because:
# 1. Properly handles special characters in the anchor links
# 2. Maintains the original spacing in the link text
# 3. Provides better error handling
# 4. Is more readable and maintainable

def convert_links(content):
# Pattern to match Notion-style links
# Matches [text](Spark%20Request-Based%20(Non-Committee)%20Global%20Retriev%204c5e8c47c45f467f80392d00cac2aae4.md)
# Note that this is curently not paramaterized and is tied to the Spark RSR export. This could be generalized in future.
pattern = r'\[([^\]]+)\]\(Spark%20Request-Based%20\(Non-Committee\)%20Global%20Retriev%204c5e8c47c45f467f80392d00cac2aae4\.md\)'

def replace_link(match):
text = match.group(1)
# Convert the link text to a proper anchor
anchor = text.lower().replace(' ', '-').replace('/', '').replace('**', '')
# Remove any other special characters
anchor = re.sub(r'[^\w-]', '', anchor)
return f'[{text}](#{anchor})'

# Replace all matches
return re.sub(pattern, replace_link, content)

def process_file(input_path, output_path=None):
# Read the input file
with open(input_path, 'r', encoding='utf-8') as f:
content = f.read()

# Convert the links
updated_content = convert_links(content)

# If no output path specified, overwrite the input file
output_path = output_path or input_path

# Write the result
with open(output_path, 'w', encoding='utf-8') as f:
f.write(updated_content)

print(f"Processed file saved to: {output_path}")

# Usage example
if __name__ == "__main__":
import sys

if len(sys.argv) < 2:
print("Usage: python fix-nortion-relative-links.py input_file [output_file]")
sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2] if len(sys.argv) > 2 else None

process_file(input_file, output_file)
89 changes: 88 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,89 @@
# service-classes
# Filecoin Service Classes <!-- omit in toc -->

- [Purpose](#purpose)
- [Background](#background)
- [Service Classes](#service-classes)
- [Service Level Objectives](#service-level-objectives)
- [Service Level Indicators](#service-level-indicators)
- [Tenets](#tenets)
- [Conventions](#conventions)
- [Abbreviations](#abbreviations)
- [Improvement Proposal Process](#improvement-proposal-process)
- [FAQ](#faq)
- [Where is performance against a service class measured and presented?](#where-is-performance-against-a-service-class-measured-and-presented)
- [How do SPs signal which service classes they are seeking conformance with?](#how-do-sps-signal-which-service-classes-they-are-seeking-conformance-with)
- [How is "versioning" handled?](#how-is-versioning-handled)
- [What is a "service class" vs. "storage class"?](#what-is-a-service-class-vs-storage-class)
- [Why don't we use the term "SLA" currently?](#why-dont-we-use-the-term-sla-currently)


# Purpose
Houses definitions, discussions, and supporting materials/processses for Filecoin service classes, SLOs, and SLIs.

# Background

Storage clients have a diverse set of needs and as a result, storage providers like AWS, GCP, etc. have created a plethora of storage options to meet these needs. At least as of 202410, Filecoin doesn’t articulate clearly what storage classes are supported, how we define them, and how we’re measuring against them. Filecoin makes strong guarantees of replication with its daily spacetime proofs, but there are additional dimensions that storage clients want to have visibility into (e.g, retrievability, performance). This was a topic of conversation during [FIL Dev Summit #4](https://www.fildev.io/FDS-4), and a ["PMF Targets Working Group"](https://www.notion.so/Filecoin-PMF-Targets-Working-Group-111837df73d480b6a3a9e5bfd73063de) was started in 2024Q3 in an attempt to change this so storage clients can know what to expect and so the Filecoin ecosystem can clearly see opportunities to fill or improve.

# Service Classes

Service Class | Status
:--: | --
["(TBD) Warm"](./service-classes/warm.md) | 2024-11-04: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values and even the name have not been determined.
["(TBD) Cold"](./service-classes/cold.md) | 2024-11-04: This is a placeholder service class to illustrate that there should be multipel service classes including one with slower retrievability than ["warm"](./service-classes/warm.md).

* A service class is set of dimensions that define a type of storage. “Archival” and “Hot” are a couple of examples with dimensions like "availability", "durability", and "performance". These service class dimensions have various [SLOs](#service-level-objectives) that should be met to satisfy the needs of that service class.
* Service classes are defined in the [`service-classes` directory](./service-classes/).
* There are intended to be many service classes.
* A service class should correspond with a set of expectations that a group of storage cliens would have for certain data. This group of storage clients would expect to see all the corresponding SLOs consistently met by an SP in order to store their corresponding data with that SP.

# Service Level Objectives
* A Service Level Objective (SLO) is quality target for a service class. It defines the “acceptable” value or threshold for a [SLI](#service-level-indicator). They set expectations for a storage clients using the storage service, and then also give clear targets that storage providers need to hit and measure themselves against.
* The SLO tuple of [SLI](#service-level-indicator-definitions) and corresponding threshold for a service classes are specified in the [`service-classes` directory](./service-classes/).
* The graph for an SLO should be the graph of the SLI and likely a horizontal line showing the threshold.

# Service Level Indicators

Service Level Indicator | Status
-- | --
[Spark Retrieval Success Rate](./service-level-indicators/spark-retrieval-success-rate.md) | ![review](https://img.shields.io/badge/status-review-yellow.svg?style=flat-square) this is the first SLI that has been worked to meet expectations both in terms of supporting documentation and being on chain. It is expected to serve as an example for to-be-created SLIs.
["(TBD) Sector Health Rate"](./service-level-indicators/sector-health-rate.md) | ![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) while this using the original proof of spacetime (PoSt) that has always been with Filecoin, the documentation for how the metric is computed and what it does and doesn't measure hasn't been developed.

* A Service Level Indicator (SLI) is a metric that measures compliance with an [SLO](#service-level-objectives). SLI is the actual measurement. To meet the SLO, the SLI will need to meet or exceed the promises made by the SLO.
* SLIs are defined in the [`service-level-indicators` directory](./service-level-indicators/).

# Tenets
Below are tenets that have been guiding this work:
1. SLIs must be on chain. We are holding this line because:
1. Forcing function for the data to actually get onchain. Compromising to allow non-chain data to start has historically made it hard to then do the lift of actually get it on chain despite all the best intentions at the start.
2. We get the benefit of onchain data, as in the immutability guarantees. This is particularly important in an assumed future where these onchain “scores” will affect the reward structures of SPs.
2. The "rules of the game" are knowable and discoverable - All participants in any markets created from this work should understand what is being measured, how it's being measured, what isn't being measured, etc. so they can reason appropriately about what something means and what actions to take if any.
3. Gatekeep on quality but not exploration - There isn't one group of people that knows what the authoriative list of service classes should be. There should be room to explore, Peer review and approval should be applied to make sure that proposed service classes and SLIs are well documented and explained more than deciding what are the right set.
4. Make room for alternatives - This is related to exploration. As a concrete example, we shouldn't discuss an unqualified "retrieval success rate" assuming there will only be a signle SLI measuring retrieval. Instead, SLIs should have proper qualification (e.g., "_Spark_ retrieval success rate") to make clear that there is opportunity for other "retrieval success rate" SLIs to emerage.

# Conventions
* If something has a placeholder name, it is usually wrapped in quotes and prefixed with `(TBD)` (e.g., "(TBD) Warm" for a to-be-named service class that stores data that is "warmer" than the "(TBD) cold" service class)

## Abbreviations
Throughout this repo, these appreviations are used:
* SLI - service level indicator
* SLO - service level objective
* SP - storage provider, meaning the entity as defined in the Filecoin protocol with an individual id that commits sectors, accepts deals, etc. We're not referring to a brand/company which might compose multiple providerIds/minderIds.

# Improvement Proposal Process
The process for proposing new service classes or SLIs, or modifying existing service classes and SLOs hasn't been determined yet. This is something we hope to get more formalized during 202411.

# FAQ
## Where is performance against a service class measured and presented?
The evaluation of an SP against a service class is expected to be done outside of this repo by any parties interested in doing so. One example is https://github.com/filecoin-project/filecoin-storage-providers-market

## How do SPs signal which service classes they are seeking conformance with?
There currently isn't any way for an SP to opt in to some service classes and out of others. We assume [measurement and presentation tools](#where-is-performance-against-a-service-class-measured-and-presented) will score SPs against all SPs and their historical performance will make clear what service classes they are actually targeting.

## How is "versioning" handled?
TODO; fill this in

## What is a "service class" vs. "storage class"?
The terms are synonimous in our context, but we are using the term "service class" since that is the industry norm.

## Why don't we use the term "SLA" currently?
“Service Level Agreement” is avoided for now because it means different things to different people. For example, anyone in the storage world who uses S3 may have come to learn that [S3 only has one SLA](https://aws.amazon.com/s3/sla/), and it only pertains to service availability. S3 has [other performance dimensions that it evaluates it storage products against](https://aws.amazon.com/s3/storage-classes/#Performance_across_the_S3_storage_classes), but there are no SLAs there. An SLA is technically a legal contract that if breached will have financial penalty. Filecoin the protocol doesn’t really have this currently except for proof of replication. (Clients may have off-chain agreements with SPs.) To keep the conversation clearer for now, we’re focusing on service classes and their SLOs. When there is actually reward and penalty for meeting these SLOs, we as a group can start introducing SLA terminology.
13 changes: 13 additions & 0 deletions service-classes/cold.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

# Status
* 2024-11-04: This is a placeholder service class to illustrate that there should be multiple service classes including one with slower retrievability than ["warm"](./service-classes/warm.md).

# Intended Users
TBD

# SLOs
Dimension | SLI | Threshold
-- | -- | --
"(TBD) Durability" | ["(TBD) Sector Health Rate"](../service-level-indicators/sector-health-rate.md) | TBD% per day

* At least as of 202411, there isn't a "retrievability SLO" since it isn't known what sort of "retrievability" SLI should be created/used for the case where only a sealed copy is kept given there is no protocol-defined way for requesting an unseal operation to then do a retrieval check within a communicated-amount-of-time later.
21 changes: 21 additions & 0 deletions service-classes/warm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

# Status
* 2024-11-04: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values and even the name have not been determined or ageed upon.

# Intended Users
This service class is targeting users who 1) expect to retrieve at least some subset of their data at least weekly and 2) when they do retrieve, to have the first byte in under a second.

# SLOs
Dimension | SLI | Threshold
-- | -- | --
Retrievability | [Spark Retrieval Success Rate](../service-level-indicators/spark-retrieval-success-rate.md) | 90% per day
"(TBD) Durability" | ["(TBD) Sector Health Rate"](../service-level-indicators/sector-health-rate.md) | 99% per day

At least as of 202411, we're targeting a retrieval success rate of 90, which seems low when compared to the "availability" guarantees that other cloud providers make. This is for a few reasons:
1. Retrievability in this decentralized Filecoin context is quite different from availability in a web2 context. Retrievability is being measured from a untrusted set of clents. web2 availability is being measured from the server side, and thus has less uncontrollable variables.
2. The [Spark Retrieval Success Rate docs](../service-level-indicators/spark-retrieval-success-rate.md) do a good job enumerating the various ways that results can be poisoned by malicious actors. This lower-than-99+% target is to account for these possibilities.
3. This level of Spark RSR is already significantly higher than the level of retrievability that most SPs were offering in early 2024. This SLO is moving SPs in a new direction, and it can be adjusted once a better threshold is determined.

This "(TBD) sector health rat"e of 99% doesn't match the "durability" targets with many 9's that web2 providers have because:
1. They are different metrics. web2 providers are looking at the durability of each byte written to their service which benefits from their infrastructure setup and erasure encoding.
2. Often in the cases where a Storage Provider misses a PoSt, they meet it in future prooving windows. This means the data wasn't lost, but rather that a sector was not-proven to the network within its prooving deadline.
73 changes: 73 additions & 0 deletions service-level-indicators/sector-health-rate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Sector Health Rate <!-- omit from toc -->

- [Meta](#meta)
- [Document Purpose](#document-purpose)
- [Versions / Status](#versions--status)
- [Support, Questions, and Feedback](#support-questions-and-feedback)
- [TL;DR](#tldr)
- [Metric Definition](#metric-definition)
- [Implementation Details](#implementation-details)
- [Option 1: Lotus RPC Calls](#option-1-lotus-rpc-calls)
- [Option 2: Lily Events](#option-2-lily-events)
- [Appendix](#appendix)
- [Callouts/Concerns with this SLI](#calloutsconcerns-with-this-sli)
- [Related Items](#related-items)


# Meta

## Document Purpose

This document is intended to become the canonical resource that is referenced in [the Storage Providers Market Dashboard](https://github.com/filecoin-project/filecoin-storage-providers-market) wherever the “(TBD) Sector Health” graphs are shown. A reader of those graphs should be able to read this document and understand the "Sector Health SLO”. The goal of this document is to explain fully and clearly “the rules of the game”. With the “game rules”, we seek to empower market participants - onramps, aggregators and Storage Providers (SPs) - to “decide how they want to play the game”.

## Versions / Status
SLI Version | Status | Comment
-- | -- | --
v1.0.0 | ![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | 2024-11-04: this was started as a placeholder to start moving the exploration work from https://github.com/davidgasquez/filecoin-data-portal/issues/79 over and to seed this repo with more than one metric definition. It needs more review, and particularly SP feedback on the caveats of this metric. It is not decided that "Sector Health Rate" is the right name or that this should be under "durability". Agains, this current iteration was done to move fast so there is more skeleton in this repo before FDS 5.


## Support, Questions, and Feedback
If you see errors in this document, please open a PR.
If you have a question that isn't answered by the document, then ...
If you want to discuss ideas for improving this proposoal, then ...

# TL;DR
Filecoin has a robust mechanism already for proving spacetime on chain for each sector. The proportion of successful proofs over time gives indication of the "durability" of data stored on these sectors.

# Metric Definition

On a daily basis and for each SP compute:
* `Number of Active Sectors`
* `Number of Faulted Sectors`

An SP's daily sector health rate is then

$$\frac{\text{Number of Active Sectors - Number of Faulted Sectors}}{\text{Number of Active Sectors}}$$

# Implementation Details
There are multiple ways to compute this metric. Multiple options are outlined as they differ in self-service local reproducibility vs. scale.

## Option 1: Lotus RPC Calls
Below explains the way to compute this method when using Lotus RPC:

* This metric is computed based on a single sampling per SP per day. This works because:
1. A sector that is faulted stays in the fault state for a duration that is a multiple of 24 hours given a sector's state transitions in and out of faulted state happens during the providing dealine for the sector.
2. New sectors in a given day may get missed until the next day, but sectors aren't a highly transient resource flipping into and out of existence. Since sectors tend to have a lifespan of months or years, not counting them on their first day isn't a significant impact on the metric over time.
* `Number of Active Sectors` is computed by getting the SP's Raw Power ([StateMinerPower](https://lotus.filecoin.io/reference/lotus/state/#stateminerpower)) divided by the SP's sector size ([StateMinerInfo](https://lotus.filecoin.io/reference/lotus/state/#stateminerinfo))).
* `Number of Faulted Sectors` is computed by daily querying for the [`StateMinerFaults`](https://lotus.filecoin.io/reference/lotus/state/#stateminerfaults) for each SP with sectors.

## Option 2: Lily Events
Below explains hoe an Filecoin blockchain event indexer like Lily can be used.

TODO: fill this in or link to the corresponding Lily query in FDP?

# Appendix

## Callouts/Concerns with this SLI

For full transparency, a list of potential issues or concerns about this SLI are presented below.

1. TODO: add items here

## Related Items
* https://github.com/davidgasquez/filecoin-data-portal/issues/79 - This is where exploration for this SLI was first done.
Loading