Manage the GPU Stack with the NVIDIA GPU Operator on Google Kubernetes Engine (GKE)

Standard

This page helps you decide when to use the NVIDIA GPU operator and shows you how to enable the NVIDIA GPU Operator on GKE.

Overview

Operators are Kubernetes software extensions that allow users to create custom resources that manage applications and their components. You can use operators to automate complex tasks beyond what Kubernetes itself provides, such as deploying and upgrading applications.

The NVIDIA GPU Operator is a Kubernetes operator that provides a common infrastructure and API for deploying, configuring, and managing software components needed to provision NVIDIA GPUs in a Kubernetes cluster. The NVIDIA GPU Operator provides you with a consistent experience, simplifies GPU resource management, and streamlines the integration of GPU-accelerated workloads into Kubernetes.

Why use the NVIDIA GPU Operator?

We recommend using GKE GPU management for your GPU nodes, because GKE fully manages the GPU node lifecycle. To get started with using GKE to manage your GPU nodes, choose one of these options:

Alternatively, the NVIDIA GPU Operator might be a suitable option for you if you're looking for a consistent experience across multiple cloud service providers, you are already using the NVIDIA GPU Operator, or if you are using software that depends on the NVIDIA GPU operator.

For more considerations when deciding between these options, refer to Manage the GPU stack through GKE or the NVIDIA GPU Operator on GKE.

Limitations

The NVIDIA GPU Operator is supported on both Container-Optimized OS (COS) and Ubuntu node images with the following limitations:

The NVIDIA GPU Operator is supported on GKE starting with GPU Operator version 24.6.0 or later.
The NVIDIA GPU Operator is not supported on Autopilot clusters.
The NVIDIA GPU Operator is not supported on Windows node images.
The NVIDIA GPU Operator is not managed by GKE. To upgrade the NVIDIA GPU Operator, refer to the NVIDIA documentation.

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Make sure you meet the requirements in Run GPUs in Standard node pools.
Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell.

While there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.
```
helm version
```
If the output is similar to Command helm not found, then you can install the Helm CLI by running this command:
```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh
```

Create and set up the GPU node pool

To create and set up the GPU node pool, follow these steps:

Create a GPU node pool by following the instructions on how to Create a GPU node pool with the following modifications:
- Set gpu-driver-version=disabled to skip automatic GPU driver installation since it's not supported when using the NVIDIA GPU operator.
- Set --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" to disable the GKE managed GPU device plugin Daemonset.
Run the following command and append other flags for GPU node pool creation as needed:
```
gcloud container node-pools create POOL_NAME \
  --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=disabled \
  --node-labels="gke-no-default-nvidia-gpu-device-plugin=true"
```
Replace the following:
- POOL_NAME the name you chose for the node pool.
- GPU_TYPE: the type of GPU accelerator that you want to use. For example, nvidia-h100-80gb.
- AMOUNT: the number of GPUs to attach to nodes in the node pool.
For example, the following command creates a GKE node pool, a3nodepool, with H100 GPUs in the zonal cluster a3-cluster. In this example, the GKE GPU device plugin Daemonset and automatic driver installation are disabled.
```
gcloud container node-pools create a3nodepool \
  --region=us-central1 --cluster=a3-cluster \
  --node-locations=us-central1-a \
  --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=disabled \
  --machine-type=a3-highgpu-8g \
  --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \
  --num-nodes=1
```
Get the authentication credentials for the cluster by running the following command:
```
USE_GKE_GCLOUD_AUTH_PLUGIN=True \
gcloud container clusters get-credentials CLUSTER_NAME [--zone COMPUTE_ZONE] [--region COMPUTE_REGION]
```
Replace the following:
- CLUSTER_NAME: the name of the cluster containing your node pool.
- COMPUTE_REGION or COMPUTE_ZONE: specify the cluster's region or zone based on whether your cluster is a regional or zonal cluster, respectively.
The output is similar to the following:
```
Fetching cluster endpoint and auth data.
kubeconfig entry generated for CLUSTER_NAME.
```
(Optional) Verify that you can connect to the cluster.
```
kubectl get nodes -o wide
```
You should see a list of all your nodes running in this cluster.
Create the namespace gpu-operator for the NVIDIA GPU Operator by running this command:
```
kubectl create ns gpu-operator
```
The output is similar to the following:
```
namespace/gpu-operator created
```

Create resource quota in the gpu-operator namespace by running this command:

kubectl apply -n gpu-operator -f - << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-operator-quota
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical
EOF

The output is similar to the following:

resourcequota/gpu-operator-quota created

View the resource quota for the gpu-operator namespace:

kubectl get -n gpu-operator resourcequota gpu-operator-quota

The output is similar to the following:

NAME                 AGE     REQUEST       LIMIT
gpu-operator-quota   2m27s   pods: 0/100

Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers.
- If using COS, run the following commands to deploy the installation DaemonSet and install the default GPU driver version:
```
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```
- If using Ubuntu, the installation DaemonSet that you deploy depends on the GPU type and on the GKE node version as described in the Ubuntu section of the instructions.

Verify the GPU driver version by running this command:

kubectl logs -l k8s-app=nvidia-driver-installer  \
  -c "nvidia-driver-installer" --tail=-1 -n kube-system

If GPU driver installation is successful, the output is similar to the following:

I0716 03:17:38.863927    6293 cache.go:66] DRIVER_VERSION=535.183.01
…
I0716 03:17:38.863955    6293 installer.go:58] Verifying GPU driver installation
I0716 03:17:41.534387    6293 install.go:543] Finished installing the drivers.

Install the NVIDIA GPU Operator

This section shows how to install the NVIDIA GPU Operator using Helm. To learn more, refer to NVIDIA's documentation on installing the NVIDIA GPU Operator.

Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
  && helm repo update

Install the NVIDIA GPU Operator using Helm with the following configuration options:
- Make sure the GPU Operator version is 24.6.0 or later.
- Configure the driver install path in the GPU Operator with hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia.
- Set the toolkit install path toolkit.installDir=/home/kubernetes/bin/nvidia for both COS and Ubuntu. In COS, the/home directory is writable and serves as a stateful location for storing the NVIDIA runtime binaries. To learn more, refer to the COS Disks and file system overview.
- Enable the Container Device Interface (CDI) in the GPU Operator with cdi.enabled=true and cdi.default=true as legacy mode is unsupported. CDI is required for both COS and Ubuntu on GKE.
```
helm install --wait --generate-name \
  -n gpu-operator \
  nvidia/gpu-operator \
  --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
  --set toolkit.installDir=/home/kubernetes/bin/nvidia \
  --set cdi.enabled=true \
  --set cdi.default=true \
  --set driver.enabled=false
```
To learn more about these settings, refer to the Common Chart Customization Options and Common Deployment Scenarios in the NVIDIA documentation.

Verify that the NVIDIA GPU operator is successfully installed.

To check that the GPU Operator operands are running correctly, run the following command.

kubectl get pods -n gpu-operator

The output looks similar to the following:

NAME                                                          READY    STATUS
RESTARTS   AGE
gpu-operator-5c7cf8b4f6-bx4rg                                 1/1      Running   0          11m
gpu-operator-node-feature-discovery-gc-79d6d968bb-g7gv9       1/1      Running   0          11m
gpu-operator-node-feature-discovery-master-6d9f8d497c-thhlz   1/1      Running   0          11m
gpu-operator-node-feature-discovery-worker-wn79l              1/1      Running   0          11m
gpu-feature-discovery-fs9gw                                   1/1      Running   0          8m14s
gpu-operator-node-feature-discovery-worker-bdqnv              1/1      Running   0          9m5s
nvidia-container-toolkit-daemonset-vr8fv                      1/1      Running   0          8m15s
nvidia-cuda-validator-4nljj                                   0/1      Completed 0          2m24s
nvidia-dcgm-exporter-4mjvh                                    1/1      Running   0          8m15s
nvidia-device-plugin-daemonset-jfbcj                          1/1      Running   0          8m15s
nvidia-mig-manager-kzncr                                      1/1      Running   0          2m5s
nvidia-operator-validator-fcrr6                               1/1      Running   0          8m15s

To check that the GPU count is configured correctly in the node's 'Allocatable' field, run the following command:

kubectl describe node GPU_NODE_NAME | grep Allocatable -A7

Replace GPU_NODE_NAME with the name of the node that has GPUs.

The output is similar to the following:

Allocatable:
cpu:                11900m
ephemeral-storage:  47060071478
hugepages-1Gi:      0
hugepages-2Mi:      0
memory:             80403000Ki
nvidia.com/gpu:     1           # showing correct count of GPU associated with the nods
pods:               110

To check that GPU workload runs correctly, you can use the cuda-vectoradd tool:

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvidia/samples:vectoradd-cuda11.2.1
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Then, run the following command:

kubectl logs cuda-vectoradd

The output is similar to the following:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Manage the GPU Stack with the NVIDIA GPU Operator on Google Kubernetes Engine (GKE)

Overview

Why use the NVIDIA GPU Operator?

Limitations

Before you begin

Create and set up the GPU node pool

Install the NVIDIA GPU Operator

What's next