Install Slurm
On this page
- 1.1 Setup Slurm Build/Installation as a Local Repository
- 1.2 Configure Slurm and Slurm Services
- 1.3 Install Slurm and Setup Configuration Files
- 1.4 Make a Local Slurm Repository and Serve it with Nginx
- 1.5 Configure the Boot Script Service and Cloud-Init
- 1.6 Boot the Compute Node with the Slurm Compute Image
- 1.7 Configure and Start Slurm in the Compute Node
- 1.8 Test Munge and Slurm
0 Overview
This guide walks through setting up Slurm on an OpenCHAMI cluster. This guide will assume you have already setup an OpenCHAMI cluster per Sections 1-2.6 in the OpenCHAMI Tutorial, therefore this guide will assume the rocky user on a Rocky Linux 9 system. Please substitute the rocky user with your normal user if you have setup an OpenCHAMI cluster outside of the tutorial. The only other requirement is to run a webserver to serve a Slurm repo for the image builder to use, and this guide assumes Podman is present to use. This guide will only walk through Slurm setup for a cluster with one head node and one compute node, but is easily expanded to multiple compute nodes by updating the node list with ochami.
0.1 Prerequisites
Note
This guide assumes you have setup an OpenCHAMI cluster per Sections 1-2.6 in the OpenCHAMI tutorial.
0.2 Contents
- 0 Overview
- 1 Setup and Configure Slurm
- 1.1 Setup Slurm Build/Installation as a Local Repository
- 1.2 Configure Slurm and Slurm Services
- 1.3 Install Slurm and Setup Configuration Files
- 1.4 Make a Local Slurm Repository and Serve it with Nginx
- 1.5 Configure the Boot Script Service and Cloud-Init
- 1.6 Boot the Compute Node with the Slurm Compute Image
- 1.7 Configure and Start Slurm in the Compute Node
- 1.8 Test Munge and Slurm
1 Setup and Configure Slurm
Steps in this section occur on the head node created in the OpenCHAMI tutorial (or otherwise).
1.1 Setup Slurm Build/Installation as a Local Repository
Install version 0.5.18 of munge. Versions 0.5-0.5.17 have a significant security vulnerability, so it is important that version 0.5.18 is used instead of 0.5.13 which is available through dnf for Rocky Linux 9. For more information see: https://nvd.nist.gov/vuln/detail/CVE-2026-25506
Change into working directory (created in Section 1.1 of the Tutorial), so that any files that are created are put here.
cd /opt/workdirGrab munge version 0.5.18 release tarball from GitHub:
curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xzConvert tarball to rpm package, build dependencies and build binary package:
sudo dnf install -y rpm-build rpmdevtools
rpmbuild -ts munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild"
sudo dnf builddep -y /opt/workdir/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm
rpmbuild -tb munge-0.5.18.tar.xz --define "_topdir /opt/workdir/rpmbuild"Install rpms created by rpmbuild:
sudo rpm --install --verbose --force \
rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm \
rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm \
rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm \
rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm \
rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm \
rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpmCheck that munge was installed correctly:
munge --versionThe output should be:
munge-0.5.18 (2026-02-10)Download Slurm pre-requisite sources compatible with Rocky 9 OS:
sudo dnf -y update && \
sudo dnf clean all && \
sudo dnf -y install epel-release && \
sudo dnf -y install dnf-plugins-core && \
sudo dnf config-manager --set-enabled devel && \
sudo dnf config-manager --set-enabled crb && \
sudo dnf groupinstall -y 'Development Tools' && \
sudo dnf install -y createrepo freeipmi freeipmi-devel dbus-devel gtk2-devel hdf5 hdf5-devel http-parser-devel \
hwloc hwloc-devel jq json-c-devel libaec libconfuse libcurl-devel libevent-devel \
libyaml libyaml-devel lua-devel lua-filesystem lua-json lua-lpeg lua-posix lua-term mariadb mariadb-devel \
ncurses-devel numactl numactl-devel oniguruma openssl-devel pam-devel \
perl-DBI perl-ExtUtils-MakeMaker perl-Switch pigz python3 python3-devel readline-devel \
lsb_release rrdtool rrdtool-devel tcl tcl-devel ucx ucx-cma ucx-devel ucx-ib wget \
lz4-devel s2n-tls-devel libjwt-devel librdkafka-devel && \
sudo dnf clean allCreate build script to install Slurm 24.05.5 and PMIX 4.2.9-1:
Note
This guide installs Slurm 24.05.5 and PMIX 4.2.9-1 to ensure compatibility. Other versions can be installed instead, but make sure to check version compatibility first.
Edit as normal user: /opt/workdir/build.sh
SLURMVERSION=${1:-24.05.5}
PMIXVERSION=${2:-4.2.9-1}
ELRELEASE=${3:-el9} #Rocky 9
subversions=( ${PMIXVERSION//-/ } )
pmixmajor=${subversions[0]}
export LC_ALL="C"
OSVERSION=$(lsb_release -r | gawk '{print $2}')
CDIR=$(pwd)
SDIR="slurm/$OSVERSION/$SLURMVERSION"
mkdir -p ${SDIR}
if [[ -e ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ]]; then
echo "The RPM of PMIX version ${PMIXVERSION} is already available."
else
cd slurm
wget https://github.com/openpmix/openpmix/releases/download/v${pmixmajor}/pmix-${PMIXVERSION}.src.rpm || {
echo "$? pmix-${PMIXVERSION}.src.rpm not downloaded"
exit
}
rpmbuild --rebuild ./pmix-${PMIXVERSION}.src.rpm &> rpmbuild-pmix-${PMIXVERSION}.log || {
echo "$? pmix-${PMIXVERSION}.src.rpm not builded, review rpmbuild-pmix-${PMIXVERSION}.log"
exit
}
cd ${CDIR}
mv /root/rpmbuild/RPMS/x86_64/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm ${SDIR}
dnf -y install ${SDIR}/pmix-${PMIXVERSION}.${ELRELEASE}.x86_64.rpm
fi
if [[ -e ${SDIR}/slurm-${SLURMVERSION}-*.rpm ]]; then
echo "The RPMs of slurm ${SLURMVERSION} are already available."
else
cd slurm
wget https://download.schedmd.com/slurm/slurm-${SLURMVERSION}.tar.bz2 || wget http://www.schedmd.com/download/archive/slurm-${SLURMVERSION}.tar.bz2 || {
echo "$? slurm-${SLURMVERSION}.tar.bz2 not downloaded"
exit
}
rpmbuild -ta --with pmix --with lua --with pam --with mysql --with ucx --with slurmrestd slurm-${SLURMVERSION}.tar.bz2 &> rpmbuild-slurm-${SLURMVERSION}.log || {
echo "$? slurm-${SLURMVERSION}.tar.bz2 not builded, review rpmbuild-slurm-${SLURMVERSION}.log"
exit
}
grep 'configure: WARNING:' rpmbuild-slurm-${SLURMVERSION}.log
cd ${CDIR}
mv /root/rpmbuild/RPMS/x86_64/slurm*-${SLURMVERSION}-*.rpm ${SDIR}
fiAdjust permissions for build script so that it is executable, and execute it with root privileges:
chmod 755 /opt/workdir/build.sh
sudo /opt/workdir/build.shNote
The following warnings are normal:
configure: WARNING: unable to locate libnvidia-ml.so and/or nvml.h
configure: WARNING: unable to locate librocm_smi64.so and/or rocm_smi.h
configure: WARNING: unable to locate libze_loader.so and/or ze_api.h
configure: WARNING: HPE Slingshot: unable to locate libcxi/libcxi.h
configure: WARNING: unable to build man page html files without man2htmlCopy the Slurm packages to the desired location to create the local repository:
sudo mkdir -p /srv/repo/rocky/9/x86_64/
sudo cp -r /opt/workdir/slurm/9.7/24.05.5 /srv/repo/rocky/9/x86_64/slurm-24.05.5Create the local repository (this will be used for installation and images later):
sudo createrepo /srv/repo/rocky/9/x86_64/slurm-24.05.5The output should be:
Directory walk started
Directory walk done - 15 packages
Temporary output repo path: /srv/repo/rocky/9/x86_64/slurm-24.05.5/.repodata/
Preparing sqlite DBs
Pool started (with 5 workers)
Pool finished1.2 Configure Slurm and Slurm Services
Create user and group βslurmβ with specified UID/GID:
SLURMID=666
sudo groupadd -g $SLURMID slurm
sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurmNote
The following warning is expected and can be ignored, as the ‘slurm’ user is a system service account so can have a UID below 1000.
useradd warning: slurm's uid 666 outside of the UID_MIN 1000 and UID_MAX 60000 range.Update the UID and GID of βmungeβ user and group to 616, update directory ownership, create munge key and restart the munge service:
# Update UID and GID
sudo usermod -u 616 munge
sudo groupmod -g 616 munge
# Fix user and group ownership
sudo chown munge:munge /var/log/munge/
sudo chown munge:munge /var/lib/munge/
sudo chown munge:munge /etc/munge/
# Create munge key
sudo -u munge /usr/sbin/mungekey -v
# Start munge again
sudo systemctl enable --now mungeInstall mariaDB:
sudo dnf -y install mariadb-serverTune mariaDB with the Slurm recommended options for the compute node where mariaDB will be running:
cat <<EOF | sudo tee /etc/my.cnf.d/innodb.cnf
[mysqld]
innodb_buffer_pool_size=5120M
innodb_log_file_size=512M
innodb_lock_wait_timeout=900
max_allowed_packet=16M
EOFNote
We are assigning 5GB to the innodb_buffer_pool_size. The pool size should be 5-50% of the available memory of the head node and at least 4GB.
Enable and start the mariaDB service as this is a single node cluster (so we aren’t enabling High Availability):
sudo systemctl enable --now mariadbSecure the mariaDB installation with a strong root password. Use pwgen to generate a password, and make sure to store this password securely. You will use the pwgen password to setup and configure MariaDB, as well as to create a database for Slurm to access the head node:
sudo dnf -y install pwgen
export SQL_PWORD="$(pwgen 20 1)"
echo "${SQL_PWORD}" # copy output for interactive prompts and so you can store it somewhere securely
sudo mysql_secure_installationMariaDB setup/settings should be done as follows:
Enter current password for root (enter for none): # enter user password (e.g. "rocky" if following tutorial)
Switch to unix_socket authentication [Y/n] Y
Change the root password? [Y/n] Y
New password: # use the password from pwgen
Re-enter new password: # use the password from pwgen
Remove anonymous users? [Y/n] n
Disallow root login remotely? [Y/n] Y
Remove test database and access to it? [Y/n] n
Reload privilege tables now? [Y/n] YCreate the database and grant access to localhost and the head node. You will need the password you generated with pwgen in the above step when prompted to “Enter password:”:
cat <<EOF | mysql -u root -p
create database slurm_acct_db;
grant all on slurm_acct_db.* to slurm@'localhost' identified by "${SQL_PWORD}";
grant all on slurm_acct_db.* to slurm@'demo.openchami.cluster' identified by "${SQL_PWORD}";
grant all on slurm_acct_db.* to slurm@'demo' identified by "${SQL_PWORD}";
exit
EOFInstall a few more dependencies that are required:
sudo dnf -y install jq libconfuse numactl parallel perl-DBI perl-SwitchSetup directory structure for the Slurm database and controller daemon services:
sudo mkdir -p /var/spool/slurmctld /var/log/slurm /run/slurm
sudo chown -R slurm. /var/spool/slurmctld /var/log/slurm /run/slurm
echo "d /run/slurm 0755 slurm slurm -" | sudo tee /usr/lib/tmpfiles.d/slurm.conf1.3 Install Slurm and Setup Configuration Files
Add the Slurm repo created earlier to install from it (will ensure we get the correct package versions):
# Create local repo file
SLURMVERSION=24.05.5
cat <<EOF | sudo tee /etc/yum.repos.d/slurm-local.repo
[slurm-local]
name=Slurm ${SLURMVERSION} - Local
baseurl=file:///srv/repo/rocky/9/x86_64/slurm-${SLURMVERSION}
gpgcheck=0
enabled=1
countme=1
EOF
# Install from local repo file
sudo dnf -y install slurm slurm-contribs slurm-example-configs slurm-libpmi slurm-pam_slurm slurm-perlapi slurm-slurmctld slurm-slurmdbd pmixCreate configuration files by copying the example files, and then modify the directory and file ownership:
# Copy configuration files
sudo cp -p /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
sudo cp -p /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
# Set directory and file ownership to slurm
sudo chown -R slurm. /etc/slurm/Modify the SlurmDB config. You will need the pwgen generated password generated earlier when setting up MariaDB for this section:
DBHOST=demo
DBPASSWORD="${SQL_PWORD}"
SLURMDBHOST1=demo
sudo sed -i "s|DbdAddr.*|DbdAddr=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|DbdHost.*|DbdHost=${SLURMDBHOST1}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|#StorageHost.*|StorageHost=${DBHOST}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|#StoragePort.*|StoragePort=3306|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|StoragePass.*|StoragePass=${DBPASSWORD}|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|SlurmUser.*|SlurmUser=slurm|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|PidFile.*|PidFile=/var/run/slurm/slurmdbd.pid|g" /etc/slurm/slurmdbd.conf
sudo sed -i "s|#StorageLoc.*|StorageLoc=slurm_acct_db|g" /etc/slurm/slurmdbd.confThe environment variable we set earlier to store the password for SQL should now be unset for security:
unset SQL_PWORDCreate the Slurm config file, which will be used by SlurmCTL. Note that you may need to update the NodeName info depending on the configuration of your compute node.
Note
If the head node is in a VM (see Head Node: Using Virtual
Machine),
the SlurmctldHost will be head instead of demo.
Edit the Slurm config file as root: /etc/slurm/slurm.conf
Add job container config file to Slurm config directory:
SLURMTMPDIR=/lscratch
cat <<EOF | sudo tee /etc/slurm/job_container.conf
# Job /tmp on a local volume mounted on ${SLURMTMPDIR}
# /dev/shm has special handling, and instead of a bind mount is always a fresh tmpfs filesystem.
BasePath=${SLURMTMPDIR}
AutoBasePath=true
Shared=true
EOFConfigure the hosts file with addresses for both the head node and the compute node:
1.4 Make a Local Slurm Repository and Serve it with Nginx
Create configuration file to mount into Nginx container:
Edit as normal user: /opt/workdir/nginx.conf
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
server {
# configuration of HTTP virtual server
location /slurm-24.05.5 {
# configuration for processing URIs for local Slurm repo
# serve static files from this path
# such that a request for /slurm-24.05.5/repodata/repomd.xml will be served /usr/share/nginx/html/slurm-24.05.5/repodata/repomd.xml
root /usr/share/nginx/html;
}
}
}Use Podman to run Nginx in a container that has the local Slurm repository and the Nginx configuration file mounted into it:
podman run --name serve-slurm \
-v /opt/workdir/nginx.conf:/etc/nginx/nginx.conf \
--mount type=bind,source=/srv/repo/rocky/9/x86_64/slurm-24.05.5,target=/usr/share/nginx/html/slurm-24.05.5,readonly \
-p 8080:80 -d nginxCheck everything is working by grabbing the repodata file from inside the head node:
curl http://localhost:8080/slurm-24.05.5/repodata/repomd.xmlThe output should be:
<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
<revision>1770960915</revision>
<data type="primary">
<checksum type="sha256">4670c00aed4cc64e542e8b76f4f59ec4dd333a2e02258ddab5b7604874915dff</checksum>
<open-checksum type="sha256">04f66940b8479413f57cf15aa66d56624aede301f064356ee667ccf4594470ef</open-checksum>
<location href="repodata/4670c00aed4cc64e542e8b76f4f59ec4dd333a2e02258ddab5b7604874915dff-primary.xml.gz"/>
<timestamp>1770960914</timestamp>
<size>5336</size>
<open-size>33064</open-size>
</data>
<data type="filelists">
<checksum type="sha256">11b43e8e70d418dbe78a8c064ca42e18d63397729ba0710323034597f681d0a4</checksum>
<open-checksum type="sha256">1f2b8e754a2db5c26557ad2a7b9c8b6a210115a4263fb153bc0445dc8210b59c</open-checksum>
<location href="repodata/11b43e8e70d418dbe78a8c064ca42e18d63397729ba0710323034597f681d0a4-filelists.xml.gz"/>
<timestamp>1770960914</timestamp>
<size>11154</size>
<open-size>68224</open-size>
</data>
<data type="other">
<checksum type="sha256">a7f25375920bf5d30d9de42a6f4aeaa8105b1150bc9bef1440e700369bcdcf53</checksum>
<open-checksum type="sha256">da1da29e2d02a626986c3647032c175e0cb768d4d643c9020e2ccc343ced93e4</open-checksum>
<location href="repodata/a7f25375920bf5d30d9de42a6f4aeaa8105b1150bc9bef1440e700369bcdcf53-other.xml.gz"/>
<timestamp>1770960914</timestamp>
<size>1229</size>
<open-size>3354</open-size>
</data>
<data type="primary_db">
<checksum type="sha256">34f7acb86f91ab845250ed939181b88acb0be454e9c42eb99cb871e3241f75e4</checksum>
<open-checksum type="sha256">038901ed7c43b991becd931370b539c29ad5c7abffefc1ce6fc20cb8e1c1b7c7</open-checksum>
<location href="repodata/34f7acb86f91ab845250ed939181b88acb0be454e9c42eb99cb871e3241f75e4-primary.sqlite.bz2"/>
<timestamp>1770960915</timestamp>
<size>12132</size>
<open-size>131072</open-size>
<database_version>10</database_version>
</data>
<data type="filelists_db">
<checksum type="sha256">1912b17f136f28e892c9591e34abc1c8ef5b466df8eed7d6d1c5adadb200c6ad</checksum>
<open-checksum type="sha256">9eb023458e4570a8c3d9407e24ee52a94befc93785e71b1f72a5d90f314762e2</open-checksum>
<location href="repodata/1912b17f136f28e892c9591e34abc1c8ef5b466df8eed7d6d1c5adadb200c6ad-filelists.sqlite.bz2"/>
<timestamp>1770960915</timestamp>
<size>15917</size>
<open-size>73728</open-size>
<database_version>10</database_version>
</data>
<data type="other_db">
<checksum type="sha256">2872ebc347c2e5fe166907ba8341dc10ef9d0419261fac253cb6bab0d1eb046f</checksum>
<open-checksum type="sha256">5db7c12e76bde1a6b5739ad5c52481633d1dd87599e86ce4d84bae8fe4504db1</open-checksum>
<location href="repodata/2872ebc347c2e5fe166907ba8341dc10ef9d0419261fac253cb6bab0d1eb046f-other.sqlite.bz2"/>
<timestamp>1770960915</timestamp>
<size>1940</size>
<open-size>24576</open-size>
<database_version>10</database_version>
</data>
</repomd>Create the compute Slurm image config file (uses the base image created in the tutorial as the parent layer):
Warning
When writing YAML, it’s important to be consistent with spacing. It is recommended to use spaces for all indentation instead of tabs.
When pasting, you may have to configure your editor to not apply indentation
rules (:set paste in Vim, :set nopaste to switch back).
Edit as root: /etc/openchami/data/images/compute-slurm-rocky9.yaml
options:
layer_type: base
name: compute-slurm
publish_tags:
- 'rocky9'
pkg_manager: dnf
gpgcheck: False
parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
registry_opts_pull:
- '--tls-verify=false'
publish_s3: 'http://demo.openchami.cluster:7070'
s3_prefix: 'compute/slurm/'
s3_bucket: 'boot-images'
repos:
- alias: 'Epel9'
url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'
- alias: 'Slurm'
url: 'http://localhost:8080/slurm-24.05.5'
packages:
- boxes
- figlet
- git
- nfs-utils
- tcpdump
- traceroute
- vim
- curl
- rpm-build
- shadow-utils
- pwgen
- jq
- libconfuse
- numactl
- parallel
- perl-DBI
- slurm-24.05.5
- pmix-4.2.9
- slurm-contribs-24.05.5
- slurm-devel-24.05.5
- slurm-example-configs-24.05.5
- slurm-libpmi-24.05.5
- slurm-pam_slurm-24.05.5
- slurm-perlapi-24.05.5
- slurm-sackd-24.05.5
- slurm-slurmctld-24.05.5
- slurm-slurmd-24.05.5
- slurm-slurmdbd-24.05.5
- slurm-slurmrestd-24.05.5
- slurm-torque-24.05.5
cmds:
- cmd: 'curl -sL https://github.com/dun/munge/releases/download/munge-0.5.18/munge-0.5.18.tar.xz -o munge-0.5.18.tar.xz'
- cmd: 'rpmbuild -ts munge-0.5.18.tar.xz'
- cmd: 'dnf builddep -y /root/rpmbuild/SRPMS/munge-0.5.18-1.el9.src.rpm'
- cmd: 'rpmbuild -tb munge-0.5.18.tar.xz'
- cmd: 'cd /root/rpmbuild'
- cmd: 'rpm --install --verbose --force /root/rpmbuild/RPMS/x86_64/munge-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debuginfo-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-debugsource-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-devel-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-0.5.18-1.el9.x86_64.rpm /root/rpmbuild/RPMS/x86_64/munge-libs-debuginfo-0.5.18-1.el9.x86_64.rpm'
- cmd: 'dnf remove -y munge-libs-0.5.13-13.el9 munge-0.5.13-13.el9'Run podman container to run image build command. The S3_ACCESS and S3_SECRET tokens are set in the tutorial here.
podman run \
--rm \
--device /dev/fuse \
--network host \
-e S3_ACCESS=${ROOT_ACCESS_KEY} \
-e S3_SECRET=${ROOT_SECRET_KEY} \
-v /etc/openchami/data/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml \
ghcr.io/openchami/image-build-el9:v0.1.2 \
image-build \
--config config.yaml \
--log-level DEBUGNote
If you have already aliased the image build command per the tutorial, you can instead run:
build-image /etc/openchami/data/images/compute-slurm-rocky9.yaml
Check that the images built.
s3cmd ls -Hr s3://boot-images/ | cut -d' ' -f 4- | grep slurmThe output should be:
1615M s3://boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9
84M s3://boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img
14M s3://boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_641.5 Configure the Boot Script Service and Cloud-Init
Get a fresh access token for ochami:
export DEMO_ACCESS_TOKEN=$(sudo bash -lc 'gen_access_token')Create payload for boot script service with URIs for slurm compute boot artefacts:
sudo mkdir -p /etc/openchami/data/boot/bss
URIS=$(s3cmd ls -Hr s3://boot-images | grep compute/slurm | awk '{print $4}' | sed 's-s3://-http://172.16.0.254:7070/-' | xargs)
URI_IMG=$(echo "$URIS" | cut -d' ' -f1)
URI_INITRAMFS=$(echo "$URIS" | cut -d' ' -f2)
URI_KERNEL=$(echo "$URIS" | cut -d' ' -f3)
cat <<EOF | sudo tee /etc/openchami/data/boot/bss/boot-compute-slurm-rocky9.yaml
---
kernel: '${URI_KERNEL}'
initrd: '${URI_INITRAMFS}'
params: 'nomodeset ro root=live:${URI_IMG} ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
- 52:54:00:be:ef:01
EOFSet BSS parameters:
ochami bss boot params set -f yaml -d @/etc/openchami/data/boot/bss/boot-compute-slurm-rocky9.yamlCheck the BSS boot parameters were added:
ochami bss boot params get -F yamlThe output should be:
- cloud-init:
meta-data: null
phone-home:
fqdn: ""
hostname: ""
instance_id: ""
pub_key_dsa: ""
pub_key_ecdsa: ""
pub_key_rsa: ""
user-data: null
initrd: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.20.1.el9_7.x86_64.img
kernel: http://172.16.0.254:9000/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.20.1.el9_7.x86_64
macs:
- 52:54:00:be:ef:01
params: nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-initCreate new directory for setting up cloud-init configuration:
sudo mkdir -p /etc/openchami/data/cloud-init
cd /etc/openchami/data/cloud-initCreate new ssh key on the head node and press Enter for all of the prompts:
ssh-keygen -t ed25519The new key that was generated can be found in ~/.ssh/id_ed25519.pub. This key will need to be used in the cloud-init meta-data configured below.
Setup the cloud-init configuration by creating ci-defaults.yaml:
cat <<EOF | sudo tee /etc/openchami/data/cloud-init/ci-defaults.yaml
---
base-url: "http://172.16.0.254:8081/cloud-init"
cluster-name: "demo"
nid-length: 2
public-keys:
- "$(cat ~/.ssh/id_ed25519.pub)"
short-name: "de"
EOFThen, set the cloud-init defaults using the ochami CLI:
ochami cloud-init defaults set -f yaml -d @/etc/openchami/data/cloud-init/ci-defaults.yamlVerify that these values were set with:
ochami cloud-init defaults get -F json-prettyThe output should be:
{
"base-url": "http://172.16.0.254:8081/cloud-init",
"cluster-name": "demo",
"nid-length": 2,
"public-keys": [
"<YOUR SSH KEY>"
],
"short-name": "nid"
}Configure cloud-init for compute group:
Edit as root: /etc/openchami/data/cloud-init/ci-group-compute.yaml
- name: compute
description: "compute config"
file:
encoding: plain
content: |
## template: jinja
#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]
users:
- name: root
ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
disable_root: false Now, set this configuration for the compute group:
ochami cloud-init group set -f yaml -d @/etc/openchami/data/cloud-init/ci-group-compute.yamlCheck that it got added with:
ochami cloud-init group get config computeThe cloud-config file created within the YAML above should get print out:
## template: jinja
#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]
users:
- name: root
ssh_authorized_keys: {{ ds.meta_data.instance_data.v1.public_keys }}
disable_root: falseochami has basic per-group template rendering available that can be used to
check that the Jinja2 is rendering properly for a node. Check if for the first
compute node (x1000c0s0b0n0):
ochami cloud-init group render compute x1000c0s0b0n0Note
This feature requires that impersonation is enabled with cloud-init. Check and
make sure that the IMPERSONATION environment variable is set in
/etc/openchami/configs/openchami.env.
The SSH key that was created above should appear in the config:
#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]
users:
- name: root
ssh_authorized_keys: ['<SSH_KEY>']1.6 Boot the Compute Node with the Slurm Compute Image
Boot the compute1 compute node VM from the compute Slurm image:
Note
If the head node is in a VM (see Head Node: Using Virtual
Machine), make sure to run the
virt-install command on the host!
Note
If you recieve following error:
ERROR Failed to open file '/usr/share/OVMF/OVMF_VARS.fd': No such file or directory
Repeat the command, but replace OVMF_VARS.fd with OVMF_VARS_4M.fd and replace OVMF_CODE.secboot.fd with OVMF_CODE_4M.secboot.fd.
If this still fails, check the path under /usr/share/OVMF to check the name of the files there, as some distros store them under varient names.
Watch it boot. First, it should PXE:
>>Start PXE over IPv4.
Station IP address is 172.16.0.1
Server IP address is 172.16.0.254
NBP filename is ipxe-x86_64.efi
NBP filesize is 1079296 Bytes
Downloading NBP file...
NBP file downloaded successfully.
BdsDxe: loading Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
BdsDxe: starting Boot0001 "UEFI PXEv4 (MAC:525400BEEF01)" from PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(525400BEEF01,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)
iPXE initialising devices...
autoexec.ipxe... Not found (https://ipxe.org/2d12618e)
iPXE 1.21.1+ (ge9a2) -- Open Source Network Boot Firmware -- https://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP VLAN SRP AoE EFI MenuThen, we should see it get it’s boot script from TFTP, then BSS (the /boot/v1 URL), then download it’s kernel/initramfs and boot into Linux.
Configuring (net0 52:54:00:be:ef:01)...... ok
tftp://172.16.0.254:69/config.ipxe... ok
Booting from http://172.16.0.254:8081/boot/v1/bootscript?mac=52:54:00:be:ef:01
http://172.16.0.254:8081/boot/v1/bootscript... ok
http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.24.1.el9_7.x86_64... ok
http://172.16.0.254:7070/boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.24.1.el9_7.x86_64.img... okDuring Linux boot, output should indicate that the SquashFS image gets downloaded and loaded.
[ 2.169210] dracut-initqueue[545]: % Total % Received % Xferd Average Speed Time Time Time Current
[ 2.170532] dracut-initqueue[545]: Dload Upload Total Spent Left Speed
100 1356M 100 1356M 0 0 1037M 0 0:00:01 0:00:01 --:--:-- 1038M
[ 3.627908] squashfs: version 4.0 (2009/01/31) Phillip LougherOnce PXE boot process is done, detach from the VM with ctrl+]. Log back into the virsh console if desired with virsh console compute1.
Tip
If the VM installation fails for any reason, it can be destroyed and undefined so that the install command can be run again.
- Shut down (“destroy”) the VM:
sudo virsh destroy compute1 - Undefine the VM:
sudo virsh undefine --nvram compute1 - Rerun the
virt-installcommand above.
Alternatively, if you want to reboot the compute node VM with an updated image, do the following:
sudo virsh destroy compute1
sudo virsh start --console compute11.7 Configure and Start Slurm in the Compute Node
Login as root to the compute node, ignoring its host key:
Note
If using a VM head node, login from there. Else, login from host.
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.16.0.1 Check that the munge and slurm packages were installed from the correct sources (e.g. slurm packages should be installed from @slurm-local repo):
dnf list installed | grep -e munge -e slurmThe output should be:
munge.x86_64 0.5.18-1.el9 @System
munge-debuginfo.x86_64 0.5.18-1.el9 @System
munge-debugsource.x86_64 0.5.18-1.el9 @System
munge-devel.x86_64 0.5.18-1.el9 @System
munge-libs.x86_64 0.5.18-1.el9 @System
munge-libs-debuginfo.x86_64 0.5.18-1.el9 @System
pmix.x86_64 4.2.9-1.el9 @8080_slurm-24.05.5
slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-contribs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-devel.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-example-configs.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-libpmi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-pam_slurm.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-perlapi.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-sackd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-slurmctld.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-slurmd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-slurmdbd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-slurmrestd.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5
slurm-torque.x86_64 24.05.5-1.el9 @8080_slurm-24.05.5 Note
If there is a version 0.5.13 of munge currently installed and present in the output from the above command, remove it to ensure that version 0.5.18 is used.
dnf remove -y munge-libs-0.5.13-<version> munge-0.5.13-<version>Create slurm config file that is identical to that of the head node. Note that you may need to update the NodeName info depending on the configuration of your compute node:
Note
If the head node is in a VM (see Head Node: Using Virtual
Machine),
the SlurmctldHost will be head instead of demo.
Edit the Slurm config file as root: /etc/slurm/slurm.conf
Configure the hosts file with addresses for both the head node and the compute node:
Create the Slurm user on the compute node:
SLURMID=666
groupadd -g $SLURMID slurm
useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SLURMID -g slurm -s /sbin/nologin slurmUpdate Slurm file and directory ownership:
chown -R slurm:slurm /etc/slurm/
chown -R slurm:slurm /var/lib/slurmNote
Use find / -name "slurm" to make sure everything that needs to be changed is identified. Note that not all results need ownership modified though, such as directories under /run/, /usr/ or /var/!
Create the directory /var/log/slurm as it doesn’t exist yet, and set ownership to Slurm:
mkdir /var/log/slurm
chown slurm:slurm /var/log/slurmCreating job_container.conf file that matches the one in the head node:
SLURMTMPDIR=/lscratch
cat <<EOF | sudo tee /etc/slurm/job_container.conf
# Job /tmp on a local volume mounted on ${SLURMTMPDIR}
# /dev/shm has special handling, and instead of a bind mount is always a fresh tmpfs filesystem.
BasePath=${SLURMTMPDIR}
AutoBasePath=true
Shared=true
EOFUpdate ownership of the job container config file:
chown slurm:slurm /etc/slurm/job_container.confMunge UID is 991 and GID is 990, so change them both to 616 (to match head node UID/GID):
usermod -u 616 munge
groupmod -g 616 mungeNote
If you get the following error:
usermod: user munge is currently used by process <PID>
Kill the process and repeat above two commands:
kill -15 <PID>
Update munge file/directory ownership:
find / -mount -writable -type d -uid 991 -exec chown -R munge:munge \{\} \;Copy the munge key from the head node to the compute node.
Inside the head node:
cd ~
sudo cp /etc/munge/munge.key ./
sudo chown "$(id -u):$(id -u)" munge.key
scp ./munge.key root@172.16.0.1:~/Inside the compute node:
mv munge.key /etc/munge/munge.key
chown munge:munge /etc/munge/munge.keyNote
In the case of an error about “Offending ECDSA key in ~/.ssh/known_hosts:3”, remove the compute node from the known hosts file and try the ‘scp’ command again:
ssh-keygen -R 172.16.0.1Alternatively, setup an ignore.conf file per Section 2.8.3 of the tutorial, to prevent this issue.
Continuing inside the compute node, setup and start the services for Slurm.
Enable and start munge service:
systemctl enable munge.service
systemctl start munge.service
systemctl status munge.serviceThe output should be:
β munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
Active: active (running) since Wed 2026-02-04 00:55:55 UTC; 1 week 2 days ago
Docs: man:munged(8)
Main PID: 1451 (munged)
Tasks: 4 (limit: 24335)
Memory: 2.2M (peak: 2.5M)
CPU: 4.710s
CGroup: /system.slice/munge.service
ββ1451 /usr/sbin/munged
Feb 04 00:55:55 de01 systemd[1]: Started MUNGE authentication service.Enable and start slurmd:
systemctl enable slurmd
systemctl start slurmd
systemctl status slurmdThe output should be:
β slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-02-13 05:59:32 UTC; 4s ago
Main PID: 30727 (slurmd)
Tasks: 1
Memory: 1.3M (peak: 1.5M)
CPU: 16ms
CGroup: /system.slice/slurmd.service
ββ30727 /usr/sbin/slurmd --systemd
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Stopped Slurm node daemon.
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: slurmd.service: Consumed 3.533s CPU time, 3.0M memory peak.
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Starting Slurm node daemon...
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPU frequency setting not configured for this node
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd version 24.05.5 started
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: slurmd started on Fri, 13 Feb 2026 05:59:32 +0000
Feb 13 05:59:32 de01.openchami.cluster systemd[1]: Started Slurm node daemon.
Feb 13 05:59:32 de01.openchami.cluster slurmd[30727]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3892 TmpDisk=778 Uptime=796812 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Disable the firewall and reset the nft ruleset in the compute node:
systemctl stop firewalld
systemctl disable firewalld
nft flush ruleset
nft list rulesetStart Slurm service daemons in the head node:
sudo systemctl start slurmdbd
sudo systemctl start slurmctldRestart Slurm service daemons in the compute node to ensure changes are applied:
systemctl restart slurmd1.8 Test Munge and Slurm
Test munge on the head node:
# Try to munge and unmunge to access the compute node
munge -n | ssh root@172.16.0.1 unmungeThe output should be:
STATUS: Success (0)
ENCODE_HOST: ??? (192.168.200.2)
ENCODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814)
DECODE_TIME: 2026-02-13 05:33:34 +0000 (1770960814)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: ??? (1000)
GID: ??? (1000)
LENGTH: 0Note
In the case of an error about “Offending ECDSA key in ~/.ssh/known_hosts:3”, remove the compute node from the known hosts file and try the ‘scp’ command again:
ssh-keygen -R 172.16.0.1Alternatively, setup an ignore.conf file per Section 2.8.3 of the tutorial, to prevent this issue.
Test that you can submit a job from the head node.
Check that the node is present and idle:
sinfoThe output should be:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 1 idle de01Create user with a Slurm account:
sudo useradd -m -s /bin/bash testuser
sudo usermod -aG wheel testuser
sudo sacctmgr create user testuser defaultaccount=root
sudo su - testuserRun a test job as the user ’testuser':
srun hostnameThe output should be:
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01If something goes wrong and your compute node goes down, restart it with this command:
sudo scontrol update NodeName=de01 State=RESUME