|
| 1 | +# GossipSub Metrics Specification |
| 2 | + |
| 3 | +> Standardized optional metrics for GossipSub implementations to enable consistent and comparable performance monitoring |
| 4 | +
|
| 5 | +| Lifecycle Stage | Maturity | Status | Latest Revision | |
| 6 | +|-----------------|---------------|--------|-----------------| |
| 7 | +| 1A | Working Draft | Active | r0, 2025-08-28 | |
| 8 | + |
| 9 | +Authors: [@dennis-tra] |
| 10 | + |
| 11 | +Interest Group: TBD |
| 12 | + |
| 13 | +[@dennis-tra]: https://github.com/dennis-tra |
| 14 | + |
| 15 | +See the [lifecycle document][lifecycle-spec] for context about the maturity level and spec status. |
| 16 | + |
| 17 | +[lifecycle-spec]: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md |
| 18 | + |
| 19 | + |
| 20 | +## Table of Contents |
| 21 | + |
| 22 | +- [GossipSub Metrics Specification](#gossipsub-metrics-specification) |
| 23 | + - [Table of Contents](#table-of-contents) |
| 24 | + - [Motivation](#motivation) |
| 25 | + - [Metric Definitions](#metric-definitions) |
| 26 | + - [Metric Update Semantics](#metric-update-semantics) |
| 27 | + - [Prometheus Export Format](#prometheus-export-format) |
| 28 | + - [Example Prometheus Output](#example-prometheus-output) |
| 29 | + - [Security Considerations](#security-considerations) |
| 30 | + |
| 31 | + |
| 32 | +## Motivation |
| 33 | + |
| 34 | +GossipSub implementations across different programming languages currently expose varying sets of metrics for observability and performance monitoring. This inconsistency makes it challenging for, e.g., node operators to deploy unified monitoring dashboards across heterogeneous deployments, compare performance characteristics between different implementations, diagnose network health issues using standardized indicators, and create portable alerting rules and runbooks. |
| 35 | + |
| 36 | +This specification defines a standardized set of **optional Prometheus-style metrics** that GossipSub implementations MAY support to enable consistent observability. The goals of this specification are to define standardized metric names, types, and labels as well as the semantic specifications for when metrics should be updated. |
| 37 | + |
| 38 | +## Metric Definitions |
| 39 | + |
| 40 | +All metrics follow Prometheus naming conventions and use the `gossipsub_` prefix. The following table defines the complete set of standardized metrics: |
| 41 | + |
| 42 | +| Metric Name | Type | Labels | Description | |
| 43 | +|-------------|------|--------|--------------| |
| 44 | +| **Peer Management** | |
| 45 | +| `gossipsub_peers_total` | Gauge | `topic` (optional) | Current number of known peers, optionally segmented by topic | |
| 46 | +| `gossipsub_mesh_peers_total` | Gauge | `topic` (required) | Current number of peers in the mesh for each topic | |
| 47 | +| `gossipsub_peer_graft_total` | Counter | `topic` (required) | Total number of GRAFT messages sent, by topic | |
| 48 | +| `gossipsub_peer_prune_total` | Counter | `topic` (required), `reason` (optional) | Total number of PRUNE messages sent, by topic and optional reason | |
| 49 | +| `gossipsub_peer_score` | Histogram | `topic` (optional) | Distribution of peer scores | |
| 50 | +| **Message Flow** | |
| 51 | +| `gossipsub_message_received_total` | Counter | `topic` (required), `validation_result` (optional) | Total messages received for processing, optionally by validation result | |
| 52 | +| `gossipsub_message_delivered_total` | Counter | `topic` (required) | Total messages successfully delivered to local subscribers | |
| 53 | +| `gossipsub_message_rejected_total` | Counter | `topic` (required), `reason` (optional) | Total messages rejected during validation, optionally by reason | |
| 54 | +| `gossipsub_message_duplicate_total` | Counter | `topic` (required) | Total duplicate messages detected and discarded | |
| 55 | +| `gossipsub_message_published_total` | Counter | `topic` (required) | Total messages published by local node | |
| 56 | +| `gossipsub_message_latency_seconds` | Histogram | `topic` (optional) | End-to-end message delivery latency in seconds | |
| 57 | +| **Protocol Control** | |
| 58 | +| `gossipsub_rpc_received_total` | Counter | `message_type` (required) | Total RPC messages received by type (publish, subscribe, unsubscribe, graft, prune, ihave, iwant, idontwant) | |
| 59 | +| `gossipsub_rpc_sent_total` | Counter | `message_type` (required) | Total RPC messages sent by type | |
| 60 | +| `gossipsub_ihave_sent_total` | Counter | `topic` (required) | Total IHAVE control messages sent per topic | |
| 61 | +| `gossipsub_iwant_sent_total` | Counter | `topic` (required) | Total IWANT control messages sent per topic | |
| 62 | +| `gossipsub_idontwant_sent_total` | Counter | `topic` (required) | Total IDONTWANT control messages sent per topic | |
| 63 | +| **Performance & Health** | |
| 64 | +| `gossipsub_heartbeat_duration_seconds` | Histogram | None | Time spent processing each heartbeat operation | |
| 65 | +| `gossipsub_peer_throttled_total` | Counter | `reason` (optional) | Total number of times peers have been throttled | |
| 66 | +| `gossipsub_backoff_violations_total` | Counter | None | Total attempts to reconnect before backoff period completion | |
| 67 | +| `gossipsub_score_penalty_total` | Counter | `penalty_type` (required), `topic` (optional) | Total peer scoring penalties applied by type | |
| 68 | + |
| 69 | +### Metric Update Semantics |
| 70 | + |
| 71 | +**Counters** are incremented when: |
| 72 | +- `*_total` metrics: Each time the corresponding event occurs (message sent/received, peer action, etc.) |
| 73 | +- Events are counted at the protocol level, not application level |
| 74 | + |
| 75 | +**Gauges** are updated when: |
| 76 | +- `*_peers_total`: Peers are added/removed from peer tracking or topic meshes |
| 77 | +- Values reflect current state at time of observation |
| 78 | + |
| 79 | +**Histograms** are updated when: |
| 80 | +- `gossipsub_peer_score`: During peer scoring operations (recommended buckets: `[-100, -10, -1, 0, 1, 10, 100, +Inf]`) |
| 81 | +- `*_latency_seconds`: When latency measurements are available (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, +Inf]`) |
| 82 | +- `*_duration_seconds`: When timing operations complete (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf]`) |
| 83 | + |
| 84 | +## Prometheus Export Format |
| 85 | + |
| 86 | +Implementations MUST export metrics in Prometheus format. |
| 87 | + |
| 88 | +### Example Prometheus Output |
| 89 | + |
| 90 | +``` |
| 91 | +# HELP gossipsub_mesh_peers_total Current number of peers in the mesh for each topic |
| 92 | +# TYPE gossipsub_mesh_peers_total gauge |
| 93 | +gossipsub_mesh_peers_total{topic="ipfs-dht"} 8 |
| 94 | +gossipsub_mesh_peers_total{topic="libp2p-announce"} 12 |
| 95 | +
|
| 96 | +# HELP gossipsub_message_received_total Total messages received for processing |
| 97 | +# TYPE gossipsub_message_received_total counter |
| 98 | +gossipsub_message_received_total{topic="ipfs-dht",validation_result="accept"} 1543 |
| 99 | +gossipsub_message_received_total{topic="ipfs-dht",validation_result="reject"} 23 |
| 100 | +
|
| 101 | +# HELP gossipsub_heartbeat_duration_seconds Time spent processing each heartbeat operation |
| 102 | +# TYPE gossipsub_heartbeat_duration_seconds histogram |
| 103 | +gossipsub_heartbeat_duration_seconds_bucket{le="0.001"} 45 |
| 104 | +gossipsub_heartbeat_duration_seconds_bucket{le="0.005"} 123 |
| 105 | +gossipsub_heartbeat_duration_seconds_bucket{le="+Inf"} 150 |
| 106 | +gossipsub_heartbeat_duration_seconds_sum 0.456 |
| 107 | +gossipsub_heartbeat_duration_seconds_count 150 |
| 108 | +``` |
| 109 | + |
| 110 | +## Security Considerations |
| 111 | + |
| 112 | +TODO: Cardinality Attack: Malicious peers could potentially cause high cardinality by creating many topics or using diverse peer IDs |
| 113 | +TODO: Information Disclosure: Topic names in metrics may reveal sensitive information about network usage patterns |
0 commit comments