Skip to content

Commit

Permalink
Add mig support in specs
Browse files Browse the repository at this point in the history
  • Loading branch information
cmd-ntrf committed Mar 25, 2024
1 parent 2fb42fb commit a50ccda
Show file tree
Hide file tree
Showing 7 changed files with 27 additions and 12 deletions.
1 change: 1 addition & 0 deletions aws/infrastructure.tf
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@ locals {
cpus = data.aws_ec2_instance_type.instance_type[values.prefix].default_vcpus
ram = data.aws_ec2_instance_type.instance_type[values.prefix].memory_size
gpus = try(one(data.aws_ec2_instance_type.instance_type[values.prefix].gpus).count, 0)
mig = lookup(values, "mig", null)
}
}
}
Expand Down
1 change: 1 addition & 0 deletions azure/infrastructure.tf
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ locals {
cpus = local.vmsizes[values.type].vcpus
ram = local.vmsizes[values.type].ram
gpus = local.vmsizes[values.type].gpus
mig = lookup(values, "mig", null)
}
}
}
Expand Down
1 change: 1 addition & 0 deletions common/configuration/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ locals {
public = tls_private_key.ed25519[values.prefix].public_key_openssh
}
}
mig = lookup(values.specs, "mig", null)
}
)
}
Expand Down
6 changes: 6 additions & 0 deletions common/configuration/puppet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,12 @@ runcmd:
- "(tar xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer && ./efa_installer.sh --yes --minimal)"
- rm -fr aws-efa-installer aws-efa-installer-latest.tar.gz
%{ endif }
%{ if mig != null }
# Install nvidia-mig-manager to enable MIG without NVIDIA drivers installed
- yum -y install https://github.com/NVIDIA/mig-parted/releases/download/v0.5.5/nvidia-mig-manager-0.5.5-1.x86_64.rpm
# It does not matter which config is selected, the goal here is to enable MIG before reboot
- nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config.yaml -c all-1g.5gb --mode-only
%{ endif }
%{ if cloud_provider == "gcp" }
# Google Cloud user-data fact generates a warning because its size is greater than what is allowed (<4096 bytes).
# We have no use for it, so we remove startup-script, user-data and user-data-encoding when running in GCE.
Expand Down
28 changes: 16 additions & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -546,19 +546,23 @@ Optional attributes can be defined:
2. `image`: specification of the image to use for this instance type. (default: global [`image`](#46-image) value).
Refer to section [10.12 - Create a compute node image](#1012-create-a-compute-node-image) to learn how this attribute can
be leveraged to accelerate compute node configuration.
3. `disk_size`: size in gibibytes (GiB) of the instance's root disk containing
3. `disk_type`: type of the instance's root disk (default: see the next table).
| Provider | `disk_type` | `disk_size` (GiB) |
| -------- | :---------- | ----------------: |
| Azure |`Premium_LRS`| 30 |
| AWS | `gp2` | 10 |
| GCP | `pd-ssd` | 20 |
| OpenStack| `null` | 10 |
| OVH | `null` | 10 |
4. `disk_size`: size in gibibytes (GiB) of the instance's root disk containing
the operating system and service software
(default: see the next table).
4. `disk_type`: type of the instance's root disk (default: see the next table).
Default root disk's attribute value per provider:
| Provider | `disk_type` | `disk_size` (GiB) |
| -------- | :---------- | ----------------: |
| Azure |`Premium_LRS`| 30 |
| AWS | `gp2` | 10 |
| GCP | `pd-ssd` | 20 |
| OpenStack| `null` | 10 |
| OVH | `null` | 10 |
(default: see the previous table).
5. `mig`: hash map of [NVIDIA Multi-Instance GPU (MIG)](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) short profile names and count used to partition the instances' GPU, example for an A100:
```
mig = { 1g.5gb = 2, 2g.10gb = 1, 3g.20gb = 1 }
```
This is only functional when the [GPU is supported](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus),
and with x86-64 processors (see [NVIDIA/mig-parted issue #30](https://github.com/NVIDIA/mig-parted/issues/30)).
For some cloud providers, it possible to define additional attributes.
The following sections present the available attributes per provider.
Expand Down
1 change: 1 addition & 0 deletions gcp/infrastructure.tf
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ locals {
cpus = data.external.machine_type[values["prefix"]].result["vcpus"]
ram = data.external.machine_type[values["prefix"]].result["ram"]
gpus = try(data.external.machine_type[values["prefix"]].result["gpus"], lookup(values, "gpu_count", 0))
mig = lookup(values, "mig", null)
}
}
}
Expand Down
1 change: 1 addition & 0 deletions openstack/infrastructure.tf
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ locals {
parseint(lookup(data.openstack_compute_flavor_v2.flavors[values.prefix].extra_specs, "resources:VGPU", "0"), 10),
parseint(split(":", lookup(data.openstack_compute_flavor_v2.flavors[values.prefix].extra_specs, "pci_passthrough:alias", "gpu:0"))[1], 10)
])
mig = lookup(values, "mig", null)
}
}
}
Expand Down

0 comments on commit a50ccda

Please sign in to comment.