Skip to content

Commit 37297cf

Browse files
authored
update docs (#17)
Signed-off-by: Dmitry Shmulevich <[email protected]>
1 parent 8640a74 commit 37297cf

File tree

4 files changed

+178
-133
lines changed

4 files changed

+178
-133
lines changed

README.md

Lines changed: 64 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# topograph
1+
# Topograph
22

33
Topograph is a component designed to expose the underlying physical network topology of a cluster to enable a workload manager make network-topology aware scheduling decisions. It consists of four major components:
44

@@ -7,14 +7,14 @@ Topograph is a component designed to expose the underlying physical network topo
77
3. **Topology Generator**
88
4. **Node Observer**
99

10+
<p align="center"><img src="docs/assets/design.png" width="600" alt="Design"></p>
11+
1012
## Components
1113

1214
### 1. CSP Connector
13-
1415
The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.
1516

1617
### 2. API Server
17-
1818
The API Server listens for network topology configuration requests on a specific port. When a request is received, the server triggers the Topology Generator to populate the configuration.
1919

2020
The API Server exposes two endpoints: one for synchronous requests and one for asynchronous requests.
@@ -23,7 +23,6 @@ The API Server exposes two endpoints: one for synchronous requests and one for a
2323
- In the asynchronous mode, the API Server promptly returns a "202 Accepted" response to the HTTP request. It then begins generating and serializing the topology configuration.
2424

2525
### 3. Topology Generator
26-
2726
The Topology Generator is the central component that manages the overall network topology of the cluster. It performs the following functions:
2827

2928
- **Notification Handling:** Receives notifications from the API Server.
@@ -34,119 +33,84 @@ The Topology Generator is the central component that manages the overall network
3433
The Node Observer is used when the Topology Generator is deployed in a Kubernetes cluster. It monitors changes in the cluster nodes.
3534
If a node's status changes (e.g., a node goes down or comes up), the Node Observer sends a request to the API Server to generate a new topology configuration.
3635

37-
## Supported Environments
38-
39-
Topograph functions using the concepts of `provider` and `engine`. Here, a `provider` refers to a CSP, and an `engine` denotes a scheduling system such as SLURM or Kubernetes.
40-
41-
### SLURM Engine
42-
43-
For the SLURM engine, topograph supports the following CSPs:
44-
- AWS
45-
- OCI
46-
- GCP
47-
- CoreWeave
48-
- Bare metal
49-
50-
### Kubernetes Engine
51-
52-
Support for the Kubernetes engine is currently in the development stage.
53-
54-
### Test Provider and Engine
55-
56-
There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes.
57-
5836
## Workflow
5937

60-
- The API Server listens on the port and notifies the Topology Generator about incoming requests.
38+
- The API Server listens on the port and notifies the Topology Generator about incoming requests. In kubernetes, the incoming requests sent by the Node Observer, which watches changes in the node status.
6139
- The Topology Generator receives the notification and attempts to gather the current network topology of the cluster.
6240
- The Topology Generator instructs the CSP Connector to retrieve the network topology from the CSP.
6341
- The CSP Connector fetches the topology and translates it from the CSP-specific format to an internal format.
6442
- The Topology Generator converts the internal format into the format expected by the user cluster (e.g., SLURM or Kubernetes).
65-
- The Topology Generator returns the network topology configuration to the API Server, which then relays it back to the requester.
6643

67-
## Topograph Installation and Configuration
68-
Topograph can operate as a standalone service within SLURM clusters or be deployed in Kubernetes clusters.
69-
70-
### Topograph as a Standalone Service
71-
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch.
72-
73-
#### Configuration
74-
The default configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). It includes settings for:
75-
- HTTP endpoint for the Topology Generator
76-
- SSL/TLS connection
77-
- environment variables
78-
79-
By default, SSL/TLS is disabled, but the server certificate and key are generated during package installation.
80-
81-
The configuration file also includes an optional section for environment variables. When specified, these variables are added to the shell environment. Note that the `PATH` variable, if provided, is appended to the existing `PATH`.
82-
83-
#### Service Management
84-
To enable and start the service, run the following commands:
85-
```bash
86-
systemctl enable topograph.service
87-
systemctl start topograph.service
88-
```
89-
90-
Upon starting, the service executes:
91-
```bash
92-
/usr/local/bin/topograph -c /etc/topograph/topograph-config.yaml
44+
## Configuration
45+
Topograph accepts its configuration file path using the `-c` command-line parameter. The configuration file is a YAML document. A sample configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml).
46+
47+
The configuration file supports the following parameters:
48+
```yaml
49+
# serving topograph endpoint
50+
http:
51+
# port: specifies the port on which the API server will listen (required).
52+
port: 49021
53+
# ssl: enables HTTPS protocol if set to `true` (optional).
54+
ssl: false
55+
56+
# request_aggregation_delay: defines the delay before processing a request (required).
57+
# Topograph aggregates multiple sequential requests within this delay into a single request,
58+
# processing only if no new requests arrive during the specified duration.
59+
request_aggregation_delay: 15s
60+
61+
# forward_service_url: specifies the URL of an external gRPC service
62+
# to which requests are forwarded (optional).
63+
# This can be useful for testing or integration with external systems.
64+
# See protos/topology.proto for details.
65+
# forward_service_url:
66+
67+
# page_size: sets the page size for topology requests against a CSP API (optional).
68+
page_size: 100
69+
70+
# ssl: specifies the paths to the TLS certificate, private key,
71+
# and CA certificate (required if `http.ssl=true`).
72+
ssl:
73+
cert: /etc/topograph/ssl/server-cert.pem
74+
key: /etc/topograph/ssl/server-key.pem
75+
ca_cert: /etc/topograph/ssl/ca-cert.pem
76+
77+
# credentials_path: specifies the path to a file containing CSP credentials (optional).
78+
# credentials_path:
79+
80+
# env: environment variable names and values to inject into Topograph's shell (optional).
81+
# The `PATH` variable, if provided, will append the specified value to the existing `PATH`.
82+
env:
83+
# SLURM_CONF: /etc/slurm/slurm.conf
84+
# PATH:
9385
```
9486

95-
To disable and stop the service, run the following commands:
96-
```bash
97-
systemctl stop topograph.service
98-
systemctl disable topograph.service
99-
systemctl daemon-reload
100-
```
101-
102-
#### Verifying Health
103-
To verify the service is healthy, you can use the following command:
104-
105-
```bash
106-
curl http://localhost:49021/healthz
107-
```
108-
109-
#### Using Toposim
110-
To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim.
111-
```bash
112-
forward_service_url: dns:localhost:49025
113-
```
114-
Then run the topograph service as normal.
115-
116-
You must then start the toposim service as such, setting the path to the test model that you want to use in simulation:
117-
```bash
118-
/usr/local/bin/topograph -m /usr/local/bin/tests/models/<cluster-model>.yaml
119-
```
120-
121-
You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider. If you want to view the tree topology, then use the command:
122-
```bash
123-
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test"}}' http://localhost:49021/v1/generate)
124-
```
87+
## Supported Environments
12588

126-
And if you want to view the block topology (with specified block sizes), use the command:
127-
```bash
128-
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test", "params":{"plugin":"topology/block", "block_sizes": <block-sizes>}}}' http://localhost:49021/v1/generate)
129-
```
89+
Topograph operates with two primary concepts: `provider` and `engine`. A `provider` represents a CSP or a similar environment, while an engine refers to a scheduling system like SLURM or Kubernetes.
13090

131-
You can query the results of either topology request with:
132-
```bash
133-
curl -s "http://localhost:49021/v1/topology?uid=$id"
134-
```
91+
Currently supported providers:
92+
- AWS
93+
- OCI
94+
- GCP
95+
- CoreWeave
96+
- Bare metal
13597

136-
Note the path specified in the topograph query should point to the same model as provided to toposim.
98+
For detailed information on supported engines, see:
99+
- [SLURM](./docs/slurm.md)
100+
- [Kubernetes](./docs/k8s.md)
137101

138-
#### Using the Cluster Topology Generator
102+
## Using Topograph
139103

140-
The Cluster Topology Generator offers three endpoints for interacting with the service. Below are the details of each endpoint:
104+
Topograph offers three endpoints for interacting with the service. Below are the details of each endpoint:
141105

142-
##### 1. Health Endpoint
106+
### 1. Health Endpoint
143107

144108
- **URL:** `http://<server>:<port>/healthz`
145109
- **Description:** This endpoint verifies the service status. It returns a "200 OK" HTTP response if the service is operational.
146110

147-
##### 2. Topology Request Endpoint
111+
### 2. Topology Request Endpoint
148112

149-
- **URL:** `http(s)://<server>:<port>/v1/generate`
113+
- **URL:** `http://<server>:<port>/v1/generate`
150114
- **Description:** This endpoint is used to request a new cluster topology.
151115
- **Payload:** The payload is a JSON object that includes the following fields:
152116
- **provider name**: (mandatory) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `cw`, `baremetal` or `test`.
@@ -162,7 +126,7 @@ The Cluster Topology Generator offers three endpoints for interacting with the s
162126
- **topology_config_path**: (mandatory) A string specifying the key for the topology config in the ConfigMap.
163127
- **topology_configmap_name**: (mandatory) A string specifying the name of the ConfigMap containing the topology config.
164128
- **topology_configmap_namespace**: (mandatory) A string specifying the namespace of the ConfigMap containing the topology config.
165-
- **nodes**: (optional) An array of regions mapping instance IDs to node names.
129+
- **nodes**: (optional) An array of regions mapping instance IDs to node names.
166130

167131
Example:
168132

@@ -205,9 +169,9 @@ The Cluster Topology Generator offers three endpoints for interacting with the s
205169

206170
- **Response:** This endpoint immediately returns a "202 Accepted" status with a unique request ID if the request is valid. If not, it returns an appropriate error code.
207171

208-
##### 3. Topology Result Endpoint
172+
### 3. Topology Result Endpoint
209173

210-
- **URL:** `http(s)://<server>:<port>/v1/topology`
174+
- **URL:** `http://<server>:<port>/v1/topology`
211175
- **Description:** This endpoint retrieves the result of a topology request.
212176
- **URL Query Parameters:**
213177
- **uid**: Specifies the request ID returned by the topology request endpoint.
@@ -223,36 +187,3 @@ id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http:/
223187

224188
curl -s "http://localhost:49021/v1/topology?uid=$id"
225189
```
226-
227-
#### Automated Solution for SLURM
228-
229-
The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up:
230-
231-
```bash
232-
strigger --set --node --down --up --flags=perm --program=<script>
233-
```
234-
235-
In this setup, the `<script>` would contain the curl command to call the endpoint:
236-
237-
```bash
238-
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate
239-
```
240-
241-
We provide the [create-topology-update-script.sh](scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger.
242-
243-
The script accepts the following parameters:
244-
- **provider name** (aws, oci, gcp, cw, baremetal)
245-
- **path to the generated topology update script**
246-
- **path to the topology.conf file**
247-
248-
Usage:
249-
```bash
250-
create-topology-update-script.sh -p <provider name> -s <topology update script> -c <path to topology.conf>
251-
```
252-
253-
Example:
254-
```bash
255-
create-topology-update-script.sh -p aws -s /etc/slurm/update-topology-config.sh -c /etc/slurm/topology.conf
256-
```
257-
258-
This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration.

docs/assets/design.png

240 KB
Loading

docs/k8s.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Topograph with Kubernetes
2+
3+
In Kubernetes, Topograph performs two main actions:
4+
5+
- Creates a ConfigMap containing the topology information.
6+
- Applies node labels that define the node’s position within the cloud topology. For instance, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will label the node with the following:
7+
- `topology.kubernetes.io/network-level-1: S1`
8+
- `topology.kubernetes.io/network-level-2: S2`
9+
- `topology.kubernetes.io/network-level-3: S3`
10+
11+
## Configuration and Deployment
12+
TBD
13+
14+
## Validation and Testing
15+
TBD

docs/slurm.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Topograph with SLURM
2+
3+
For the SLURM engine, topograph supports [tree](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/tree) and [block](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/block) topology configurations.
4+
5+
### Test Provider and Engine
6+
There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes.
7+
8+
## Installation and Configuration
9+
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch.
10+
11+
The configuration file and certificates created by the installer are located in the /etc/topograph directory.
12+
13+
#### Service Management
14+
To enable and start the service, run the following commands:
15+
```bash
16+
systemctl enable topograph.service
17+
systemctl start topograph.service
18+
```
19+
20+
Upon starting, the service executes:
21+
```bash
22+
/usr/local/bin/topograph -c /etc/topograph/topograph-config.yaml
23+
```
24+
25+
To disable and stop the service, run the following commands:
26+
```bash
27+
systemctl stop topograph.service
28+
systemctl disable topograph.service
29+
systemctl daemon-reload
30+
```
31+
32+
#### Verifying Health
33+
To verify the service is healthy, you can use the following command:
34+
35+
```bash
36+
curl http://localhost:49021/healthz
37+
```
38+
39+
#### Using Toposim
40+
To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim.
41+
```bash
42+
forward_service_url: dns:localhost:49025
43+
```
44+
Then run the topograph service as normal.
45+
46+
You must then start the toposim service as such, setting the path to the test model that you want to use in simulation:
47+
```bash
48+
/usr/local/bin/topograph -m /usr/local/bin/tests/models/<cluster-model>.yaml
49+
```
50+
51+
You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider.
52+
If you want to view the tree topology, then use the command:
53+
```bash
54+
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test"}}' http://localhost:49021/v1/generate)
55+
```
56+
57+
And if you want to view the block topology (with specified block sizes), use the command:
58+
```bash
59+
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test", "params":{"plugin":"topology/block", "block_sizes": <block-sizes>}}}' http://localhost:49021/v1/generate)
60+
```
61+
62+
You can query the results of either topology request with:
63+
```bash
64+
curl -s "http://localhost:49021/v1/topology?uid=$id"
65+
```
66+
Note the path specified in the topograph query should point to the same model as provided to toposim.
67+
68+
#### Automated Solution for SLURM
69+
70+
The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up:
71+
72+
```bash
73+
strigger --set --node --down --up --flags=perm --program=<script>
74+
```
75+
76+
In this setup, the `<script>` would contain the curl command to call the endpoint:
77+
78+
```bash
79+
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate
80+
```
81+
82+
We provide the [create-topology-update-script.sh](../scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger.
83+
84+
The script accepts the following parameters:
85+
- **provider name** (aws, oci, gcp, cw, baremetal)
86+
- **path to the generated topology update script**
87+
- **path to the topology.conf file**
88+
89+
Usage:
90+
```bash
91+
create-topology-update-script.sh -p <provider name> -s <topology update script> -c <path to topology.conf>
92+
```
93+
94+
Example:
95+
```bash
96+
create-topology-update-script.sh -p aws -s /etc/slurm/update-topology-config.sh -c /etc/slurm/topology.conf
97+
```
98+
99+
This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration.

0 commit comments

Comments
 (0)