November 13, 2024 in HPC, OpenCHAMI, LANL by Travis Cotton and Alex Lovell-Troy3 minutes
This blog post is an abridged version of the training we give internal sysadmins at LANL. It guides you through the whole process of building and deploying OpenCHAMI on a set of small teaching clusters that we maintain for that purpose. For more details and example image configurations, visit our repo at github.com/OpenCHAMI/mini-bootcamp
To get started, you’ll need:
Install necessary packages for OpenCHAMI deployment:
dnf install -y ansible git podman jq
Edit your /etc/hosts
file to include entries for your cluster. For example:
172.16.0.254 stratus.openchami.cluster
172.16.0.1 st001
#...additional entries for each node
Install powerman
and conman
for node power and console management:
dnf install -y powerman conman jq
Configure Powerman: Add device and node info to /etc/powerman/powerman.conf
using your shortnames.
device "ipmi0" "ipmipower" "/usr/sbin/ipmipower -D lanplus -u admin -p Password123! -h pst[001-009] -I 17 -W ipmiping |&"
node "st[001-009]" "ipmi0" "pst[001-009]"
Start Powerman:
systemctl start powerman
systemctl enable powerman
Use buildah
to create a lightweight test image.
Install buildah:
dnf install -y buildah
Build the base image:
CNAME=$(buildah from scratch)
MNAME=$(buildah mount $CNAME)
dnf groupinstall -y --installroot=$MNAME --releasever=8 "Minimal Install"
Set up the kernel and dependencies:
dnf install -y --installroot=$MNAME kernel dracut-live fuse-overlayfs cloud-init
Rebuild initrd:
buildah run --tty $CNAME bash -c 'dracut --add "dmsquash-live livenet network-manager" --kver $(basename /lib/modules/*) -N -f --logfile /tmp/dracut.log 2>/dev/null'
Save the image:
buildah commit $CNAME test-image:v1
OpenCHAMI relies on several key microservices:
Clone the deployment recipes repository:
git clone https://github.com/OpenCHAMI/deployment-recipes.git
Go to the LANL podman-quadlets recipe:
cd deployment-recipes/lanl/podman-quadlets
Inventory Setup: Edit the inventory/01-ochami
file to specify your hostname.
Cluster Configurations: Update inventory/group_vars/ochami/cluster.yaml
with your cluster name and shortname.
SSH Key Pair: Generate an SSH key, add it to inventory/group_vars/ochami/cluster.yaml
under cluster_boot_ssh_pub_key
.
Run the Playbook:
ansible-playbook -l $HOSTNAME -c local -i inventory -t configs ochami_playbook.yaml
After rebooting, run the full playbook:
ansible-playbook -l $HOSTNAME -c local -i inventory ochami_playbook.yaml
Check that the expected containers are running:
# podman ps | awk '{print $NF}' | sort
Verify Services: Ensure SMD, BSS, and cloud-init are populated correctly.
ochami-cli smd --get-components
ochami-cli bss --get-bootparams
ochami-cli cloud-init --get-ci-data --name compute
Boot Nodes: Start and monitor node boots using pm
and conman
commands.
Logs for Debugging: Open additional terminal windows to monitor logs for DHCP, BSS, and cloud-init.
For more complex deployments, use the image-builder tool to build layered images.
podman build -t image-builder:test -f dockerfiles/dnf/Dockerfile_interactive .
podman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'image-build --log-level INFO --config /data/image-configs/base.yaml'
tpm-manager
to handle secure data distribution to nodes.By now, you should have a fully deployed OpenCHAMI environment, equipped with essential microservices and custom-built images, ready to scale. As a final step, consider adding further integrations like Slurm for job scheduling and network-mounted filesystems for additional storage solutions.