Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding support #322

Closed
wants to merge 34 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
0c26587
Add install of nvidia-mig-manager and make it configure profile
cmd-ntrf Oct 16, 2023
59b4858
Enable mig support in gres.conf
cmd-ntrf Oct 18, 2023
0b96f67
Rename mig_profile to mig in specs
cmd-ntrf Oct 18, 2023
1976e91
Add missing $nodes in slurm controller
cmd-ntrf Oct 18, 2023
1c3f22a
Add missing $instances in slurm controller
cmd-ntrf Oct 18, 2023
9fe8963
Add mig profile to nodes.conf
cmd-ntrf Oct 18, 2023
6845ca7
Add missing s to gpus
cmd-ntrf Oct 18, 2023
0350777
Remove comment
cmd-ntrf Oct 18, 2023
df69d55
Fix nodes.conf
cmd-ntrf Oct 18, 2023
c7f5a82
Add missing File to gres.conf on controller
cmd-ntrf Oct 18, 2023
21478cf
Replace each by map
cmd-ntrf Oct 18, 2023
29d9692
Replace to_yaml by to_json
cmd-ntrf Oct 18, 2023
8cb2c5c
Fix mig-parted apply command
cmd-ntrf Oct 18, 2023
7de8522
Fix unless of mig-parted
cmd-ntrf Oct 18, 2023
cce45a8
Consider case where there are more than one gpu card
cmd-ntrf Oct 18, 2023
61b755d
Fix gres.conf for slurm contorller
cmd-ntrf Oct 18, 2023
98dc79b
Fix gres.conf with multi-profiles
cmd-ntrf Oct 18, 2023
a004515
Update yum repo source for slurm
cmd-ntrf Oct 19, 2023
5515789
Limit slurm version to 23.02
cmd-ntrf Oct 19, 2023
01d990c
Fix nvidia-mig-manager exec requirements
cmd-ntrf Oct 19, 2023
8536379
Bump mig-manager version to 0.5.5
cmd-ntrf Jan 9, 2024
3e56bf4
Disable slurmd starting at boot
cmd-ntrf Jan 30, 2024
071e581
Define the parameter allowing slurm to reboot nodes
cmd-ntrf Jan 30, 2024
e8ce2a0
Disable nvidia-mig-manager service
cmd-ntrf Feb 1, 2024
459f30c
Move mig install and config in its own class
cmd-ntrf Feb 2, 2024
c940364
Rename mig config file puppet-config.yaml
cmd-ntrf Feb 2, 2024
5562e6d
Fix require config.yaml
cmd-ntrf Feb 2, 2024
04cff2d
Fix profile::gpu::install::mig
cmd-ntrf Feb 2, 2024
7b38475
Activate hooks for mig-manager and define custom hooks
cmd-ntrf Feb 6, 2024
1f64ab0
Add nvidia-persistenced to driver services in hooks.sh
cmd-ntrf Feb 6, 2024
2da54f1
Remove apply-exit hook and notify nvidia services in Puppet
cmd-ntrf Feb 6, 2024
664cfc0
Fix small issues
Feb 7, 2024
2fae8a4
Merge pull request #319 from etiennedub/mig
cmd-ntrf Feb 7, 2024
e30363d
Update template for shard
Feb 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ profile::freeipa::mokey::access_tags: "%{alias('profile::users::ldap::access_tag
profile::freeipa::server::id_start: 60001
profile::software_stack::min_uid: "%{alias('profile::freeipa::server::id_start')}"

profile::slurm::base::slurm_version: '22.05'
profile::slurm::base::slurm_version: '23.02'
profile::slurm::base::os_reserved_memory: 512
profile::slurm::controller::autoscale_version: '0.5.1'

Expand Down
86 changes: 85 additions & 1 deletion site/profile/manifests/gpu.pp
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,9 @@
}
}

class profile::gpu::install::passthrough (Array[String] $packages) {
class profile::gpu::install::passthrough (
Array[String] $packages,
) {
$os = "rhel${::facts['os']['release']['major']}"
$arch = $::facts['os']['architecture']
if versioncmp($::facts['os']['release']['major'], '8') >= 0 {
Expand All @@ -96,6 +98,11 @@
path => ['/usr/bin'],
}

$mig_profile = lookup("terraform.instances.${facts['networking']['hostname']}.specs.mig")
if $mig_profile {
include profile::gpu::install::mig
}

package { $packages:
ensure => 'installed',
require => [
Expand Down Expand Up @@ -130,6 +137,83 @@
}
}

class profile::gpu::install::mig (
String $mig_manager_version = '0.5.5',
) {
$mig_profile = lookup("terraform.instances.${facts['networking']['hostname']}.specs.mig")
$arch = $::facts['os']['architecture']

package { 'nvidia-mig-manager':
ensure => 'latest',
provider => 'rpm',
name => 'nvidia-mig-manager',
source => "https://github.com/NVIDIA/mig-parted/releases/download/v${$mig_manager_version}/nvidia-mig-manager-${mig_manager_version}-1.${arch}.rpm",
}

service { 'nvidia-mig-manager':
ensure => stopped,
enable => false,
require => Package['nvidia-mig-manager'],
}

file { '/etc/nvidia-mig-manager/puppet-config.yaml':
require => Package['nvidia-mig-manager'],
content => @("EOT")
version: v1
mig-configs:
default:
- devices: all
mig-enabled: true
mig-devices: ${to_json($mig_profile)}
|EOT
}

file_line { 'nvidia-persistenced.service':
ensure => present,
path => '/etc/nvidia-mig-manager/hooks.sh',
after => 'driver_services=\(',
line => ' nvidia-persistenced.service',
require => Package['nvidia-mig-manager'],
}

file { '/etc/nvidia-mig-manager/puppet-hooks.yaml':
require => Package['nvidia-mig-manager'],
content => @("EOT")
version: v1
hooks:
pre-apply-mode:
- workdir: "/etc/nvidia-mig-manager"
command: "/bin/bash"
args: ["-x", "-c", "source hooks.sh; stop_driver_services"]
- workdir: "/etc/nvidia-mig-manager"
command: "/bin/sh"
args: ["-c", "systemctl -q is-active slurmd && systemctl stop slurmd || true"]
|EOT
}

exec { 'nvidia-mig-parted apply':
unless => 'nvidia-mig-parted assert',
require => [
Package['nvidia-mig-manager'],
File['/etc/nvidia-mig-manager/puppet-config.yaml'],
File['/etc/nvidia-mig-manager/puppet-hooks.yaml'],
],
environment => [
'MIG_PARTED_CONFIG_FILE=/etc/nvidia-mig-manager/puppet-config.yaml',
'MIG_PARTED_HOOKS_FILE=/etc/nvidia-mig-manager/puppet-hooks.yaml',
'MIG_PARTED_SELECTED_CONFIG=default',
'MIG_PARTED_SKIP_RESET=false',
],
path => ['/usr/bin'],
notify => [
Service['nvidia-persistenced'],
Service['nvidia-dcgm'],
],
}

Package <| tag == profile::gpu::install::passthrough |> -> Exec['nvidia-mig-parted apply']
}

class profile::gpu::install::vgpu (
Enum['rpm', 'bin', 'none'] $installer = 'none',
String $nvidia_ml_py_version = '11.515.75',
Expand Down
44 changes: 28 additions & 16 deletions site/profile/manifests/slurm.pp
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
class profile::slurm::base (
String $cluster_name,
String $munge_key,
Enum['20.11', '21.08', '22.05', '23.02'] $slurm_version,
Enum['23.02'] $slurm_version,
Integer $os_reserved_memory,
Integer $suspend_time = 3600,
Integer $resume_timeout = 3600,
Expand Down Expand Up @@ -143,7 +143,7 @@
require => Package['munge']
}

$yumrepo_prefix = "https://download.copr.fedorainfracloud.org/results/cmdntrf/Slurm${slurm_version}/"
$yumrepo_prefix = "https://download.copr.fedorainfracloud.org/results/cmdntrf/Slurm${slurm_version}-nvml"
yumrepo { 'slurm-copr-repo':
enabled => true,
descr => "Copr repo for Slurm${slurm_version} owned by cmdntrf",
Expand Down Expand Up @@ -257,18 +257,6 @@
}),
}

file { '/etc/slurm/gres.conf':
ensure => 'present',
owner => 'slurm',
group => 'slurm',
content => epp('profile/slurm/gres.conf',
{
'nodes' => $nodes,
}
),
seltype => 'etc_t'
}

file { '/opt/software/slurm/bin/cond_restart_slurm_services':
require => Package['slurm'],
mode => '0755',
Expand Down Expand Up @@ -424,6 +412,20 @@
contain profile::slurm::base
include profile::mail::server

$instances = lookup('terraform.instances')
$nodes = $instances.filter|$key, $attr| { 'node' in $attr['tags'] }
file { '/etc/slurm/gres.conf':
ensure => 'present',
owner => 'slurm',
group => 'slurm',
content => epp('profile/slurm/gres.conf',
{
'nodes' => $nodes,
}
),
seltype => 'etc_t'
}

file { '/usr/sbin/slurm_mail':
ensure => 'present',
source => 'puppet:///modules/profile/slurm/slurm_mail',
Expand Down Expand Up @@ -654,9 +656,20 @@
group => 'slurm'
}

if $facts['nvidia_gpu_count'] > 0 {
file { '/etc/slurm/gres.conf':
notify => Service['slurmd'],
seltype => 'etc_t',
content => @(EOT)
AutoDetect=nvml
|EOT
}
}

Exec <| tag == profile::cvmfs |> -> Service['slurmd']
Exec <| tag == profile::freeipa |> -> Service['slurmd']
Exec <| tag == profile::gpu |> -> Service['slurmd']
Exec <| tag == profile::gpu::install::mig |> ~> Service['slurmd']
Exec <| tag == profile::jupyterhub |> -> Service['slurmd']
Kmod::Load <| |> -> Service['slurmd']
Mount <| |> -> Service['slurmd']
Expand All @@ -671,14 +684,13 @@

service { 'slurmd':
ensure => 'running',
enable => true,
enable => false,
subscribe => [
File['/etc/slurm/cgroup.conf'],
File['/etc/slurm/plugstack.conf'],
File['/etc/slurm/slurm.conf'],
File['/etc/slurm/slurm-addendum.conf'],
File['/etc/slurm/nodes.conf'],
File['/etc/slurm/gres.conf'],
],
require => [
Package['slurm-slurmd'],
Expand Down
12 changes: 7 additions & 5 deletions site/profile/templates/slurm/gres.conf.epp
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
###########################################################
AutoDetect=off
<% $nodes.each |$name, $attr| { -%>
<% if $attr['specs']['gpus'] > 1 { -%>
NodeName=<%= $name %> Name=gpu File=/dev/nvidia[0-<%= $attr['specs']['gpus'] - 1 %>]
<% } elsif $attr['specs']['gpus'] == 1 { -%>
NodeName=<%= $name %> Name=gpu File=/dev/nvidia0
<% }} -%>
<% if $attr['specs']['gpus'] > 0 { -%>
<% if $attr['specs']['mig'] { -%>
<% $attr['specs']['mig'].map|$key,$value| { -%>
NodeName=<%= $name %> Name=gpu Type=<%= $key %> Count=<%= $value * $attr['specs']['gpus'] %> File=<%= join( range(0, $value * $attr['specs']['gpus'] - 1).map|$i| { "/dev/nvidia-mig-${key}-${i}" } , ',') %>
<% }} else { -%>
NodeName=<%= $name %> Name=gpu Count=<%= $attr['specs']['gpus'] %> File=<%= join( range(0, $attr['specs']['gpus']-1).map|$i| { "/dev/nvidia${i}" } , ',') %>
<% }}} -%>
4 changes: 2 additions & 2 deletions site/profile/templates/slurm/nodes.conf.epp
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ NodeName=DEFAULT MemSpecLimit=<%= $memlimit %> State=CLOUD
# Always online computes nodes
<% $nodes.each |$name, $attr| { -%>
<% if !('pool' in $attr['tags']) { -%>
NodeName=<%= $name %> CPUs=<%= $attr['specs']['cpus'] %> RealMemory=<%= $attr['specs']['ram'] %> Gres=gpu:<%= $attr['specs']['gpus'] %> Weight=<%= $weights[$name] %>
NodeName=<%= $name %> CPUs=<%= $attr['specs']['cpus'] %> RealMemory=<%= $attr['specs']['ram'] %> Gres=<%= if $attr['specs']['gpus'] > 0 { if $attr['specs']['mig'] { join($attr['specs']['mig'].map|$key,$value| { join(["gpu", $key, $value * $attr['specs']['gpus']], ':') }, ',') } else { "gpu:${attr['specs']['gpus']}" } } else { "gpu:0" } %><%= if $attr['specs']['shard'] {",shard:${attr['specs']['shard']}"} else { '' } %> Weight=<%= $weights[$name] %>
<% } -%>
<% } -%>

# On-demand pool compute nodes
<% $nodes.each |$name, $attr| { -%>
<% if 'pool' in $attr['tags'] { -%>
NodeName=<%= $name %> CPUs=<%= $attr['specs']['cpus'] %> RealMemory=<%= $attr['specs']['ram'] %> Gres=gpu:<%= $attr['specs']['gpus'] %> Weight=<%= $weights[$name] %>
NodeName=<%= $name %> CPUs=<%= $attr['specs']['cpus'] %> RealMemory=<%= $attr['specs']['ram'] %> Gres=<%= if $attr['specs']['gpus'] > 0 { if $attr['specs']['mig'] { join($attr['specs']['mig'].map|$key,$value| { join(["gpu", $key, $value * $attr['specs']['gpus']], ':') }, ',') } else { "gpu:${attr['specs']['gpus']}" } } else { "gpu:0" } %><%= if $attr['specs']['shard'] {",shard:${attr['specs']['shard']}"} %> Weight=<%= $weights[$name] %>
<% } -%>
<% } -%>
3 changes: 2 additions & 1 deletion site/profile/templates/slurm/slurm.conf.epp
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# NODE CONFIGURATIONS
GresTypes=gpu
GresTypes=gpu,shard
cmd-ntrf marked this conversation as resolved.
Show resolved Hide resolved

TreeWidth=<%= $nb_nodes %>
ReturnToService=2 # A DOWN node will become available for use upon registration with a valid configuration.
RebootProgram=/usr/sbin/reboot
ResumeProgram=/usr/bin/slurm_resume
SuspendProgram=/usr/bin/slurm_suspend
ResumeFailProgram=/usr/bin/slurm_suspend
Expand Down
Loading