As modern AI, machine learning, and high-performance computing (HPC) workloads grow more complex, the need for flexibility, scalability, and efficient resource utilization becomes paramount. NVIDIA’s Multi-Instance GPU (MIG) feature allows a single GPU to be partitioned into multiple instances, offering a highly flexible way to share GPU resources among multiple jobs. When combined with a powerful workload manager like Slurm on Azure, you can further enhance your HPC environments by efficiently allocating and scheduling compute jobs across these GPU instances.
In this blog post, we’ll dive into how to configure and leverage NVIDIA MIG with Slurm on Azure to optimize and accelerate compute-heavy workloads. We’ll explore what MIG is, why it’s beneficial for resource management, and walk through the steps to set it up in an Azure-based Slurm environment.
What is NVIDIA MIG (Multi-Instance GPU)?
NVIDIA’s MIG is a feature introduced with NVIDIA A100 and extended to H100 GPUs, which enables a single GPU to be split into multiple independent instances. Each of these instances acts as a smaller, isolated GPU with its own dedicated compute cores, memory, and bandwidth. This allows users to efficiently allocate GPU resources to multiple smaller tasks or users, without the overhead of underutilizing an entire GPU.
For example, instead of dedicating a full, powerful A100 GPU to a task that requires only a fraction of the GPU’s resources, you can create several MIG instances on a single GPU. This approach maximizes hardware utilization, particularly in multi-user or multi-tenant environments.
Why Use NVIDIA MIG with Slurm on Azure?
Combining NVIDIA MIG with Slurm on Azure offers several advantages for organizations running AI, machine learning, and HPC workloads:
- Improved GPU Utilization: MIG allows multiple users or jobs to share a single GPU without interfering with each other, ensuring that GPU resources are fully utilized. This is ideal for running smaller, concurrent workloads.
- Granular Resource Allocation: Slurm is a robust workload manager that can efficiently schedule jobs and allocate compute resources. With MIG, Slurm can assign specific GPU instances to each job, giving you fine-grained control over how GPU resources are allocated.
- Cost Efficiency: Instead of purchasing multiple GPUs or spinning up entire VM instances with single GPUs for smaller workloads, MIG enables you to split a single GPU, reducing infrastructure costs while still maximizing performance.
- Scalability on Azure: With Azure, you can easily scale out your GPU-accelerated Slurm clusters, adding or removing MIG-enabled GPU instances based on workload demands.
Prerequisites
Before getting started, ensure you have the following:
- An Azure subscription.
- Azure CLI installed and configured.
- A basic understanding of Slurm and NVIDIA GPU drivers.
- Access to NVIDIA A100 or H100 GPUs in Azure, which support MIG.
- CUDA installed for GPU-accelerated workloads.
Step 1: Set Up an Azure Virtual Machine with A100 or H100 GPUs
The first step is to set up an Azure VM with an NVIDIA A100 or H100 GPU that supports the MIG feature. Here’s how to provision a VM in Azure with a supported GPU:
- Login to Azure CLI:bashCopy code
az login
- Create a Resource Group: Create a resource group where your VM and resources will reside:bashCopy code
az group create --name myResourceGroup --location eastus
- Provision a VM with NVIDIA GPUs: Deploy a VM with A100 or H100 GPUs. The Standard_ND96asr_v4 SKU is a good choice for GPU workloads that support MIG:bashCopy code
az vm create \ --resource-group myResourceGroup \ --name myMIGVM \ --image UbuntuLTS \ --size Standard_ND96asr_v4 \ --admin-username azureuser \ --generate-ssh-keys
- Install NVIDIA Drivers: Once the VM is running, you need to install the NVIDIA drivers to enable GPU support:bashCopy code
sudo apt update sudo apt install -y nvidia-driver-525 nvidia-utils-525
- Reboot the VM: Reboot the VM to ensure that the drivers are properly installed:bashCopy code
sudo reboot
Step 2: Enable and Configure MIG on Your GPU
Now that the VM is set up and has GPU drivers installed, you can enable the MIG feature on your A100 or H100 GPU. Here’s how:
- Enable MIG Mode: Use the
nvidia-smi
tool to enable MIG mode on your GPU:bashCopy codesudo nvidia-smi -mig 1
- Create MIG Instances: Once MIG mode is enabled, you can create up to 7 GPU instances, depending on your workload needs. Here’s an example of creating two instances:bashCopy code
sudo nvidia-smi mig -cgi 19,19 -C
The19
refers to a GPU instance profile, where each profile defines how many GPU resources (like memory and compute cores) will be assigned to each instance. In this case, we’re creating two medium-sized instances with approximately 10 GB of memory each. - Verify MIG Instances: To confirm that your MIG instances are created, run the following:bashCopy code
nvidia-smi
You should see the two newly created MIG instances listed.
Step 3: Install and Configure Slurm on Azure
With MIG enabled on your GPU, the next step is to install and configure Slurm to manage jobs and allocate GPU resources.
- Install Slurm: SSH into your VM and install Slurm using the following steps:bashCopy code
sudo apt install -y slurm-wlm
- Configure Slurm for MIG: Update your Slurm configuration to account for the GPU instances. Add the following lines to your slurm.conf file:bashCopy code
GresTypes=gpu NodeName=myMIGVM Gres=gpu:2 CPUs=48 RealMemory=180000 Sockets=1 CoresPerSocket=24 ThreadsPerCore=2 PartitionName=gpu_partition Nodes=myMIGVM Default=YES MaxTime=INFINITE State=UP
This tells Slurm that your VM has two GPU instances (as configured with MIG) and defines the number of CPUs and memory available for job scheduling. - Set Up cgroup for GRES: To ensure proper resource allocation, you’ll need to set up cgroup for the GPU resources. Add the following to your cgroup configuration:bashCopy code
ConstrainDevices=yes GresTypes=gpu
- Restart Slurm: After making these changes, restart the Slurm service:bashCopy code
sudo systemctl restart slurmctld
Step 4: Submit a Job to Slurm Using MIG GPUs
Now that Slurm is configured and running, you can submit a job to the MIG-enabled GPUs.
- Create a Job Script: Here’s an example job script that runs a GPU-accelerated task using one of the MIG instances:bashCopy code
#!/bin/bash #SBATCH --job-name=gpu_test #SBATCH --gres=gpu:1 # Request 1 GPU instance #SBATCH --time=00:30:00 #SBATCH --output=output.log module load cuda/11.3 ./my_gpu_program
- Submit the Job: Submit the job to Slurm using the following command:bashCopy code
sbatch my_job_script.sh
- Monitor the Job: You can monitor the progress of your job using the
squeue
command:bashCopy codesqueue -u azureuser
- Check GPU Utilization: Once the job starts, use
nvidia-smi
to confirm that the job is utilizing the correct MIG instance:bashCopy codenvidia-smi
Step 5: Optimize and Scale MIG Workloads
With MIG and Slurm set up, you can now optimize the workload distribution across the GPU instances. For example:
- Multiple Concurrent Jobs: By splitting a GPU into multiple MIG instances, Slurm can allocate one or more jobs to each GPU instance. This enables concurrent GPU tasks to run without bottlenecking or underutilizing resources.
- Auto-Scaling in Azure: Use Azure’s autoscaling capabilities to dynamically add more GPU VMs as workload demands increase. You can configure the Azure VM Scale Set to scale out the GPU nodes as needed.
Conclusion
Combining the power of NVIDIA MIG with Slurm on Azure provides a flexible, scalable, and efficient way to manage GPU resources for HPC, AI, and machine learning workloads. MIG’s ability to partition a GPU into smaller instances allows for granular control and optimal resource usage, especially in multi-tenant or multi-job environments.