Skip to content

Commit 44be9d7

Browse files
committed
add: metrics spec draft
1 parent e5af917 commit 44be9d7

File tree

1 file changed

+113
-0
lines changed

1 file changed

+113
-0
lines changed

metrics/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# GossipSub Metrics Specification
2+
3+
> Standardized optional metrics for GossipSub implementations to enable consistent and comparable performance monitoring
4+
5+
| Lifecycle Stage | Maturity | Status | Latest Revision |
6+
|-----------------|---------------|--------|-----------------|
7+
| 1A | Working Draft | Active | r0, 2025-08-28 |
8+
9+
Authors: [@dennis-tra]
10+
11+
Interest Group: TBD
12+
13+
[@dennis-tra]: https://github.com/dennis-tra
14+
15+
See the [lifecycle document][lifecycle-spec] for context about the maturity level and spec status.
16+
17+
[lifecycle-spec]: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md
18+
19+
20+
## Table of Contents
21+
22+
- [GossipSub Metrics Specification](#gossipsub-metrics-specification)
23+
- [Table of Contents](#table-of-contents)
24+
- [Motivation](#motivation)
25+
- [Metric Definitions](#metric-definitions)
26+
- [Metric Update Semantics](#metric-update-semantics)
27+
- [Prometheus Export Format](#prometheus-export-format)
28+
- [Example Prometheus Output](#example-prometheus-output)
29+
- [Security Considerations](#security-considerations)
30+
31+
32+
## Motivation
33+
34+
GossipSub implementations across different programming languages currently expose varying sets of metrics for observability and performance monitoring. This inconsistency makes it challenging for, e.g., node operators to deploy unified monitoring dashboards across heterogeneous deployments, compare performance characteristics between different implementations, diagnose network health issues using standardized indicators, and create portable alerting rules and runbooks.
35+
36+
This specification defines a standardized set of **optional Prometheus-style metrics** that GossipSub implementations MAY support to enable consistent observability. The goals of this specification are to define standardized metric names, types, and labels as well as the semantic specifications for when metrics should be updated.
37+
38+
## Metric Definitions
39+
40+
All metrics follow Prometheus naming conventions and use the `gossipsub_` prefix. The following table defines the complete set of standardized metrics:
41+
42+
| Metric Name | Type | Labels | Description |
43+
|-------------|------|--------|--------------|
44+
| **Peer Management** |
45+
| `gossipsub_peers_total` | Gauge | `topic` (optional) | Current number of known peers, optionally segmented by topic |
46+
| `gossipsub_mesh_peers_total` | Gauge | `topic` (required) | Current number of peers in the mesh for each topic |
47+
| `gossipsub_peer_graft_total` | Counter | `topic` (required) | Total number of GRAFT messages sent, by topic |
48+
| `gossipsub_peer_prune_total` | Counter | `topic` (required), `reason` (optional) | Total number of PRUNE messages sent, by topic and optional reason |
49+
| `gossipsub_peer_score` | Histogram | `topic` (optional) | Distribution of peer scores |
50+
| **Message Flow** |
51+
| `gossipsub_message_received_total` | Counter | `topic` (required), `validation_result` (optional) | Total messages received for processing, optionally by validation result |
52+
| `gossipsub_message_delivered_total` | Counter | `topic` (required) | Total messages successfully delivered to local subscribers |
53+
| `gossipsub_message_rejected_total` | Counter | `topic` (required), `reason` (optional) | Total messages rejected during validation, optionally by reason |
54+
| `gossipsub_message_duplicate_total` | Counter | `topic` (required) | Total duplicate messages detected and discarded |
55+
| `gossipsub_message_published_total` | Counter | `topic` (required) | Total messages published by local node |
56+
| `gossipsub_message_latency_seconds` | Histogram | `topic` (optional) | End-to-end message delivery latency in seconds |
57+
| **Protocol Control** |
58+
| `gossipsub_rpc_received_total` | Counter | `message_type` (required) | Total RPC messages received by type (publish, subscribe, unsubscribe, graft, prune, ihave, iwant, idontwant) |
59+
| `gossipsub_rpc_sent_total` | Counter | `message_type` (required) | Total RPC messages sent by type |
60+
| `gossipsub_ihave_sent_total` | Counter | `topic` (required) | Total IHAVE control messages sent per topic |
61+
| `gossipsub_iwant_sent_total` | Counter | `topic` (required) | Total IWANT control messages sent per topic |
62+
| `gossipsub_idontwant_sent_total` | Counter | `topic` (required) | Total IDONTWANT control messages sent per topic |
63+
| **Performance & Health** |
64+
| `gossipsub_heartbeat_duration_seconds` | Histogram | None | Time spent processing each heartbeat operation |
65+
| `gossipsub_peer_throttled_total` | Counter | `reason` (optional) | Total number of times peers have been throttled |
66+
| `gossipsub_backoff_violations_total` | Counter | None | Total attempts to reconnect before backoff period completion |
67+
| `gossipsub_score_penalty_total` | Counter | `penalty_type` (required), `topic` (optional) | Total peer scoring penalties applied by type |
68+
69+
### Metric Update Semantics
70+
71+
**Counters** are incremented when:
72+
- `*_total` metrics: Each time the corresponding event occurs (message sent/received, peer action, etc.)
73+
- Events are counted at the protocol level, not application level
74+
75+
**Gauges** are updated when:
76+
- `*_peers_total`: Peers are added/removed from peer tracking or topic meshes
77+
- Values reflect current state at time of observation
78+
79+
**Histograms** are updated when:
80+
- `gossipsub_peer_score`: During peer scoring operations (recommended buckets: `[-100, -10, -1, 0, 1, 10, 100, +Inf]`)
81+
- `*_latency_seconds`: When latency measurements are available (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, +Inf]`)
82+
- `*_duration_seconds`: When timing operations complete (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf]`)
83+
84+
## Prometheus Export Format
85+
86+
Implementations MUST export metrics in Prometheus format.
87+
88+
### Example Prometheus Output
89+
90+
```
91+
# HELP gossipsub_mesh_peers_total Current number of peers in the mesh for each topic
92+
# TYPE gossipsub_mesh_peers_total gauge
93+
gossipsub_mesh_peers_total{topic="ipfs-dht"} 8
94+
gossipsub_mesh_peers_total{topic="libp2p-announce"} 12
95+
96+
# HELP gossipsub_message_received_total Total messages received for processing
97+
# TYPE gossipsub_message_received_total counter
98+
gossipsub_message_received_total{topic="ipfs-dht",validation_result="accept"} 1543
99+
gossipsub_message_received_total{topic="ipfs-dht",validation_result="reject"} 23
100+
101+
# HELP gossipsub_heartbeat_duration_seconds Time spent processing each heartbeat operation
102+
# TYPE gossipsub_heartbeat_duration_seconds histogram
103+
gossipsub_heartbeat_duration_seconds_bucket{le="0.001"} 45
104+
gossipsub_heartbeat_duration_seconds_bucket{le="0.005"} 123
105+
gossipsub_heartbeat_duration_seconds_bucket{le="+Inf"} 150
106+
gossipsub_heartbeat_duration_seconds_sum 0.456
107+
gossipsub_heartbeat_duration_seconds_count 150
108+
```
109+
110+
## Security Considerations
111+
112+
TODO: Cardinality Attack: Malicious peers could potentially cause high cardinality by creating many topics or using diverse peer IDs
113+
TODO: Information Disclosure: Topic names in metrics may reveal sensitive information about network usage patterns

0 commit comments

Comments
 (0)