As machine learning, AI, and high-performance computing workloads continue to expand, there’s an increasing need for powerful infrastructure to handle these demanding tasks. Azure Kubernetes Service (AKS) offers an excellent platform for orchestrating and managing containerized applications at scale. However, many AI/ML workloads also require the raw compute power of GPUs (Graphics Processing Units) to accelerate training and inference tasks. By combining AKS with GPU-enabled nodes, you can supercharge your containerized workloads with high-performance compute capabilities.
In this blog post, we’ll dive into how you can use Azure Kubernetes Service with GPUs to run AI, machine learning, and other compute-heavy tasks. We’ll cover setting up AKS with GPU nodes, configuring your environment, and optimizing workloads for GPU utilization.
Why Use GPUs with AKS?
GPUs are specialized hardware accelerators that are significantly faster than CPUs for certain types of calculations, particularly those involved in machine learning, AI, and deep learning tasks. Key benefits of using GPUs in AKS include:
- Faster Computation: AI training models, image processing, and scientific simulations require large-scale parallel computations, which GPUs handle more efficiently than CPUs.
- Scalability: AKS allows you to run containerized workloads on a scalable Kubernetes cluster. By integrating GPU nodes, you can scale your AI/ML tasks across many GPU-enabled nodes, ensuring high throughput for complex tasks.
- Cost Efficiency: You can optimize GPU utilization by using Kubernetes’ autoscaling capabilities. This means only spinning up GPU nodes when needed, making GPU usage more cost-efficient.
Prerequisites
Before you get started, ensure you have the following:
- An Azure subscription.
- Azure CLI installed.
- A basic understanding of Kubernetes and Docker.
- Familiarity with Azure Kubernetes Service (AKS).
- A GPU-enabled workload, such as a TensorFlow or PyTorch application.
Step 1: Create an AKS Cluster with GPU Nodes
To use GPUs with AKS, you’ll need to create a cluster that includes GPU-enabled nodes. Azure provides a variety of N-series virtual machines (VMs) that are designed specifically for GPU workloads.
- Login to Azure CLI: Open your terminal and log in to Azure:bashCopy code
az login
- Create a Resource Group: Create a resource group where your AKS cluster will reside:bashCopy code
az group create --name myResourceGroup --location eastus
- Create an AKS Cluster with GPU Nodes: Now, create an AKS cluster with a node pool that includes GPU-enabled VMs. You can use the Standard_NC6s_v3 or Standard_ND40rs_v2 VM sizes, which provide access to NVIDIA GPUs.bashCopy code
az aks create \ --resource-group myResourceGroup \ --name myAKSCluster \ --node-count 1 \ --node-vm-size Standard_NC6s_v3 \ --enable-addons monitoring \ --generate-ssh-keys
This command will create a Kubernetes cluster with one node, using the Standard_NC6s_v3 VM size, which is equipped with NVIDIA Tesla V100 GPUs. - Install NVIDIA GPU Drivers: After creating the cluster, you need to ensure that your GPU nodes have the necessary NVIDIA drivers installed. You can deploy a DaemonSet that installs the drivers on every GPU node in your cluster.bashCopy code
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.0.0-beta/nvidia-device-plugin.yml
This will deploy the NVIDIA device plugin DaemonSet, which manages the GPUs in your AKS nodes and makes them available for workloads.
Step 2: Deploy a GPU-Accelerated Application on AKS
Now that your AKS cluster is up and running with GPU nodes, let’s deploy a GPU-accelerated application. For this example, we’ll use a simple TensorFlow application that performs matrix multiplication to take advantage of GPU acceleration.
- Create a Docker Image: First, create a Docker container that includes the TensorFlow image with GPU support. Here’s a basic Dockerfile you can use:DockerfileCopy code
FROM tensorflow/tensorflow:latest-gpu WORKDIR /app COPY . /app CMD ["python", "gpu_test.py"]
- gpu_test.py: Here’s a Python script (
gpu_test.py
) that utilizes TensorFlow to perform matrix multiplication on the GPU:pythonCopy codeimport tensorflow as tf print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) # Create a matrix multiplication on GPU with tf.device('/GPU:0'): a = tf.random.normal([10000, 10000]) b = tf.random.normal([10000, 10000]) c = tf.matmul(a, b) print("Matrix multiplication result: ", c)
- Build and Push Docker Image to Azure Container Registry (ACR): Build the Docker image and push it to Azure Container Registry (ACR), or any container registry of your choice:bashCopy code
# Build the image docker build -t myacr.azurecr.io/tensorflow-gpu:latest . # Push the image to ACR docker push myacr.azurecr.io/tensorflow-gpu:latest
- Deploy to AKS: Now, create a Kubernetes deployment that uses GPU resources. Save the following YAML file (
gpu-deployment.yaml
):yamlCopy codeapiVersion: apps/v1 kind: Deployment metadata: name: gpu-deployment spec: replicas: 1 selector: matchLabels: app: gpu-app template: metadata: labels: app: gpu-app spec: containers: - name: gpu-app image: myacr.azurecr.io/tensorflow-gpu:latest resources: limits: nvidia.com/gpu: 1 # Request one GPU nodeSelector: kubernetes.io/hostname: "aks-nodepool1-XXXXXXXX-0" # Specify your GPU node
- Apply the Deployment: Deploy the application to your AKS cluster:bashCopy code
kubectl apply -f gpu-deployment.yaml
- Verify GPU Utilization: Once the deployment is complete, you can check the logs to see if the GPU is being used:bashCopy code
kubectl logs <pod-name>
You should see a message indicating how many GPUs are available and that TensorFlow is utilizing the GPU for computation.
Step 3: Optimize and Scale Your GPU Workloads
With AKS, you can take advantage of Kubernetes’ native scaling and orchestration capabilities to optimize your GPU usage.
- Horizontal Pod Autoscaling (HPA): Set up Horizontal Pod Autoscaling to scale your GPU-enabled pods based on CPU or custom metrics:bashCopy code
kubectl autoscale deployment gpu-deployment --cpu-percent=50 --min=1 --max=10
- Cluster Autoscaler: Enable the cluster autoscaler to automatically scale the number of GPU nodes in your AKS cluster based on the current workload:bashCopy code
az aks update \ --resource-group myResourceGroup \ --name myAKSCluster \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 5
This ensures that your AKS cluster only uses GPU nodes when they are needed, optimizing both performance and cost.
Best Practices for Using GPUs with AKS
- Right-Size Your GPU Instances: Choose the correct VM size based on the workload. For example, NC-series VMs are optimized for AI and deep learning, while ND-series VMs are ideal for large-scale model training.
- Use GPU-Optimized Containers: When building Docker images for GPU workloads, ensure that you’re using GPU-optimized base images, such as TensorFlow-GPU or PyTorch-GPU, to take full advantage of the hardware.
- Monitor GPU Usage: Use Kubernetes monitoring tools like Prometheus and Grafana to track GPU usage metrics and ensure that your workloads are efficiently using the available GPU resources.
- Set Resource Limits: Always define resource requests and limits for your GPU workloads in your Kubernetes manifests. This ensures that Kubernetes schedules your workloads on GPU nodes and avoids over-provisioning.
Conclusion
By combining the powerful orchestration capabilities of Azure Kubernetes Service (AKS) with GPU-accelerated nodes, you can scale and manage AI, machine learning, and compute-intensive workloads efficiently. Whether you’re training large neural networks or processing large-scale data, using GPUs in AKS can significantly reduce processing time and costs. With proper configuration, scaling, and optimization, AKS with GPUs can unlock immense computing power for your containerized workloads.