You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Topograph is a component designed to expose the underlying physical network topology of a cluster to enable a workload manager make network-topology aware scheduling decisions. It consists of four major components:
4
4
@@ -7,14 +7,14 @@ Topograph is a component designed to expose the underlying physical network topo
The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.
15
16
16
17
### 2. API Server
17
-
18
18
The API Server listens for network topology configuration requests on a specific port. When a request is received, the server triggers the Topology Generator to populate the configuration.
19
19
20
20
The API Server exposes two endpoints: one for synchronous requests and one for asynchronous requests.
@@ -23,7 +23,6 @@ The API Server exposes two endpoints: one for synchronous requests and one for a
23
23
- In the asynchronous mode, the API Server promptly returns a "202 Accepted" response to the HTTP request. It then begins generating and serializing the topology configuration.
24
24
25
25
### 3. Topology Generator
26
-
27
26
The Topology Generator is the central component that manages the overall network topology of the cluster. It performs the following functions:
28
27
29
28
-**Notification Handling:** Receives notifications from the API Server.
@@ -34,119 +33,84 @@ The Topology Generator is the central component that manages the overall network
34
33
The Node Observer is used when the Topology Generator is deployed in a Kubernetes cluster. It monitors changes in the cluster nodes.
35
34
If a node's status changes (e.g., a node goes down or comes up), the Node Observer sends a request to the API Server to generate a new topology configuration.
36
35
37
-
## Supported Environments
38
-
39
-
Topograph functions using the concepts of `provider` and `engine`. Here, a `provider` refers to a CSP, and an `engine` denotes a scheduling system such as SLURM or Kubernetes.
40
-
41
-
### SLURM Engine
42
-
43
-
For the SLURM engine, topograph supports the following CSPs:
44
-
- AWS
45
-
- OCI
46
-
- GCP
47
-
- CoreWeave
48
-
- Bare metal
49
-
50
-
### Kubernetes Engine
51
-
52
-
Support for the Kubernetes engine is currently in the development stage.
53
-
54
-
### Test Provider and Engine
55
-
56
-
There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes.
57
-
58
36
## Workflow
59
37
60
-
- The API Server listens on the port and notifies the Topology Generator about incoming requests.
38
+
- The API Server listens on the port and notifies the Topology Generator about incoming requests. In kubernetes, the incoming requests sent by the Node Observer, which watches changes in the node status.
61
39
- The Topology Generator receives the notification and attempts to gather the current network topology of the cluster.
62
40
- The Topology Generator instructs the CSP Connector to retrieve the network topology from the CSP.
63
41
- The CSP Connector fetches the topology and translates it from the CSP-specific format to an internal format.
64
42
- The Topology Generator converts the internal format into the format expected by the user cluster (e.g., SLURM or Kubernetes).
65
-
- The Topology Generator returns the network topology configuration to the API Server, which then relays it back to the requester.
66
43
67
-
## Topograph Installation and Configuration
68
-
Topograph can operate as a standalone service within SLURM clusters or be deployed in Kubernetes clusters.
69
-
70
-
### Topograph as a Standalone Service
71
-
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch.
72
-
73
-
#### Configuration
74
-
The default configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). It includes settings for:
75
-
- HTTP endpoint for the Topology Generator
76
-
- SSL/TLS connection
77
-
- environment variables
78
-
79
-
By default, SSL/TLS is disabled, but the server certificate and key are generated during package installation.
80
-
81
-
The configuration file also includes an optional section for environment variables. When specified, these variables are added to the shell environment. Note that the `PATH` variable, if provided, is appended to the existing `PATH`.
82
-
83
-
#### Service Management
84
-
To enable and start the service, run the following commands:
Topograph accepts its configuration file path using the `-c` command-line parameter. The configuration file is a YAML document. A sample configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml).
46
+
47
+
The configuration file supports the following parameters:
48
+
```yaml
49
+
# serving topograph endpoint
50
+
http:
51
+
# port: specifies the port on which the API server will listen (required).
52
+
port: 49021
53
+
# ssl: enables HTTPS protocol if set to `true` (optional).
54
+
ssl: false
55
+
56
+
# request_aggregation_delay: defines the delay before processing a request (required).
57
+
# Topograph aggregates multiple sequential requests within this delay into a single request,
58
+
# processing only if no new requests arrive during the specified duration.
59
+
request_aggregation_delay: 15s
60
+
61
+
# forward_service_url: specifies the URL of an external gRPC service
62
+
# to which requests are forwarded (optional).
63
+
# This can be useful for testing or integration with external systems.
64
+
# See protos/topology.proto for details.
65
+
# forward_service_url:
66
+
67
+
# page_size: sets the page size for topology requests against a CSP API (optional).
68
+
page_size: 100
69
+
70
+
# ssl: specifies the paths to the TLS certificate, private key,
71
+
# and CA certificate (required if `http.ssl=true`).
72
+
ssl:
73
+
cert: /etc/topograph/ssl/server-cert.pem
74
+
key: /etc/topograph/ssl/server-key.pem
75
+
ca_cert: /etc/topograph/ssl/ca-cert.pem
76
+
77
+
# credentials_path: specifies the path to a file containing CSP credentials (optional).
78
+
# credentials_path:
79
+
80
+
# env: environment variable names and values to inject into Topograph's shell (optional).
81
+
# The `PATH` variable, if provided, will append the specified value to the existing `PATH`.
82
+
env:
83
+
# SLURM_CONF: /etc/slurm/slurm.conf
84
+
# PATH:
93
85
```
94
86
95
-
To disable and stop the service, run the following commands:
96
-
```bash
97
-
systemctl stop topograph.service
98
-
systemctl disable topograph.service
99
-
systemctl daemon-reload
100
-
```
101
-
102
-
#### Verifying Health
103
-
To verify the service is healthy, you can use the following command:
104
-
105
-
```bash
106
-
curl http://localhost:49021/healthz
107
-
```
108
-
109
-
#### Using Toposim
110
-
To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim.
111
-
```bash
112
-
forward_service_url: dns:localhost:49025
113
-
```
114
-
Then run the topograph service as normal.
115
-
116
-
You must then start the toposim service as such, setting the path to the test model that you want to use in simulation:
You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider. If you want to view the tree topology, then use the command:
Topograph operates with two primary concepts: `provider` and `engine`. A `provider` represents a CSP or a similar environment, while an engine refers to a scheduling system like SLURM or Kubernetes.
130
90
131
-
You can query the results of either topology request with:
Note the path specified in the topograph query should point to the same model as provided to toposim.
98
+
For detailed information on supported engines, see:
99
+
-[SLURM](./docs/slurm.md)
100
+
-[Kubernetes](./docs/k8s.md)
137
101
138
-
####Using the Cluster Topology Generator
102
+
## Using Topograph
139
103
140
-
The Cluster Topology Generator offers three endpoints for interacting with the service. Below are the details of each endpoint:
104
+
Topograph offers three endpoints for interacting with the service. Below are the details of each endpoint:
141
105
142
-
#####1. Health Endpoint
106
+
### 1. Health Endpoint
143
107
144
108
-**URL:**`http://<server>:<port>/healthz`
145
109
-**Description:** This endpoint verifies the service status. It returns a "200 OK" HTTP response if the service is operational.
146
110
147
-
#####2. Topology Request Endpoint
111
+
### 2. Topology Request Endpoint
148
112
149
-
-**URL:**`http(s)://<server>:<port>/v1/generate`
113
+
-**URL:**`http://<server>:<port>/v1/generate`
150
114
-**Description:** This endpoint is used to request a new cluster topology.
151
115
-**Payload:** The payload is a JSON object that includes the following fields:
152
116
-**provider name**: (mandatory) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `cw`, `baremetal` or `test`.
@@ -162,7 +126,7 @@ The Cluster Topology Generator offers three endpoints for interacting with the s
162
126
-**topology_config_path**: (mandatory) A string specifying the key for the topology config in the ConfigMap.
163
127
-**topology_configmap_name**: (mandatory) A string specifying the name of the ConfigMap containing the topology config.
164
128
-**topology_configmap_namespace**: (mandatory) A string specifying the namespace of the ConfigMap containing the topology config.
165
-
-**nodes**: (optional) An array of regions mapping instance IDs to node names.
129
+
-**nodes**: (optional) An array of regions mapping instance IDs to node names.
166
130
167
131
Example:
168
132
@@ -205,9 +169,9 @@ The Cluster Topology Generator offers three endpoints for interacting with the s
205
169
206
170
-**Response:** This endpoint immediately returns a "202 Accepted" status with a unique request ID if the request is valid. If not, it returns an appropriate error code.
207
171
208
-
#####3. Topology Result Endpoint
172
+
### 3. Topology Result Endpoint
209
173
210
-
-**URL:**`http(s)://<server>:<port>/v1/topology`
174
+
-**URL:**`http://<server>:<port>/v1/topology`
211
175
-**Description:** This endpoint retrieves the result of a topology request.
212
176
-**URL Query Parameters:**
213
177
-**uid**: Specifies the request ID returned by the topology request endpoint.
The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up:
In this setup, the `<script>` would contain the curl command to call the endpoint:
236
-
237
-
```bash
238
-
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate
239
-
```
240
-
241
-
We provide the [create-topology-update-script.sh](scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger.
This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration.
In Kubernetes, Topograph performs two main actions:
4
+
5
+
- Creates a ConfigMap containing the topology information.
6
+
- Applies node labels that define the node’s position within the cloud topology. For instance, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will label the node with the following:
For the SLURM engine, topograph supports [tree](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/tree) and [block](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/block) topology configurations.
4
+
5
+
### Test Provider and Engine
6
+
There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes.
7
+
8
+
## Installation and Configuration
9
+
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch.
10
+
11
+
The configuration file and certificates created by the installer are located in the /etc/topograph directory.
12
+
13
+
#### Service Management
14
+
To enable and start the service, run the following commands:
To disable and stop the service, run the following commands:
26
+
```bash
27
+
systemctl stop topograph.service
28
+
systemctl disable topograph.service
29
+
systemctl daemon-reload
30
+
```
31
+
32
+
#### Verifying Health
33
+
To verify the service is healthy, you can use the following command:
34
+
35
+
```bash
36
+
curl http://localhost:49021/healthz
37
+
```
38
+
39
+
#### Using Toposim
40
+
To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim.
41
+
```bash
42
+
forward_service_url: dns:localhost:49025
43
+
```
44
+
Then run the topograph service as normal.
45
+
46
+
You must then start the toposim service as such, setting the path to the test model that you want to use in simulation:
You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider.
52
+
If you want to view the tree topology, then use the command:
Note the path specified in the topograph query should point to the same model as provided to toposim.
67
+
68
+
#### Automated Solution for SLURM
69
+
70
+
The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up:
In this setup, the `<script>` would contain the curl command to call the endpoint:
77
+
78
+
```bash
79
+
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate
80
+
```
81
+
82
+
We provide the [create-topology-update-script.sh](../scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger.
This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration.
0 commit comments