New Content for NVIDIA T4 GPU Support

Created 2 new topics in Node management - HW Acceleration Devices: - Configure NVIDIA GPU Operator for PCI Passthrough - Delete the GPU Operator Patch 4: Added NVIDIA information in Planning - Verified Comm HW Patch 5: Acted on Greg's comment Patch 6: updated Index as requested in review worked on comments from Ghada Patch 7 and 8: acted on Mary's comments Added 'release-caveat' Acted on Ron's comments Story: 2008434 Task: 42220 https://review.opendev.org/c/starlingx/docs/+/785251 Signed-off-by: Adil <mohamed.adilassakkali@windriver.com> Change-Id: I337e33e805d89621436b35c238aca800b0727e0b
2021-04-07 14:42:46 -03:00 · 2021-04-07 14:42:46 -03:00 · 3053ff6e40
commit 3053ff6e40
parent 1521b4c4a9
4 changed files with 259 additions and 11 deletions
--- a/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst
+++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough.rst
@ -0,0 +1,171 @@
 .. fgy1616003207054
 .. _configure-nvidia-gpu-operator-for-pci-passthrough:
 =================================================
 Configure NVIDIA GPU Operator for PCI Passthrough
 =================================================
 |release-caveat|
 This section provides instructions for configuring NVIDIA GPU Operator.
 .. rubric:: |context|
 .. note::
    NVIDIA GPU Operator is only supported for standard performance kernel
    profile. There is no support provided for low-latency performance kernel
    profile.
 NVIDIA GPU Operator automates the installation, maintenance, and management of
 NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that
 require nvidia.com/gpu resources.
 NVIDIA GPU Operator is delivered as a Helm chart to install a number of services
 and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA
 software components. These components include:
 .. _fgy1616003207054-ul-sng-blk-z4b:
 -   NVIDIA drivers \(to enable CUDA which is a parallel computing platform\)
 -   Kubernetes device plugin for GPUs
 -   NVIDIA Container Runtime
 -   Automatic Node labelling
 -   DCGM \(NVIDIA Data Center GPU Manager\) based monitoring
 .. rubric:: |prereq|
 Download the **gpu-operator-v3-1.6.0.3.tgz** file at
 `http://mirror.starlingx.cengn.ca/mirror/starlingx/
 <http://mirror.starlingx.cengn.ca/mirror/starlingx/>`__.
 Use the following steps to configure the GPU Operator container:
 .. rubric:: |proc|
 #.  Lock the hosts\(s\).
    .. code-block:: none
        ~(keystone_admin)]$  system host-lock <hostname>
 #.  Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment.
    .. code-block:: none
        ~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
 #.  Unlock the hosts\(s\). Once the system is unlocked, the system will reboot automatically.
    .. code-block:: none
        ~(keystone_admin)]$ system host-unlock <hostname>
 #.  Create the RuntimeClass resource definition and apply it to the system.
    .. code-block:: none
        cat > nvidia.yml << EOF
            kind: RuntimeClass
            apiVersion: node.k8s.io/v1beta1
            metadata:
              name: nvidia
            handler: nvidia
        EOF
    .. code-block:: none
        ~(keystone_admin)]$ kubectl apply -f nvidia.yml
 #.  Install the GPU Operator Helm charts.
    .. code-block:: none
        ~(keystone_admin)]$ helm install -–name gpu-operator /path/to/gpu-operator-1.6.0.3.tgz
 #.  Check if the GPU Operator is deployed using the following command.
    .. code-block:: none
        ~(keystone_admin)]$ kubectl get pods –A
        NAMESPACE               NAME      READY  STATUS    RESTART  AGE
        default                 g-node..  1/1    Running   1       7h54m
        default                 g-node..  1/1    Running   1       7h54m
        default                 gpu-ope.  1/1    Running   1       7h54m
        gpu-operator-resources  gpu-..    1/1    Running   4       28m
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
        gpu-operator-resources  nvidia..  0/1    Completed 0       7h53m
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
    The plugin validation pod is marked completed.
 #.  Check if the nvidia.com/gpu resources are available using the following command.
    .. code-block:: none
        ~(keystone_admin)]$ kubectl describe nodes <hostname> | grep nvidia
 #.  Create a pod that uses the NVIDIA RuntimeClass and requests a
    nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch
    a pod NVIDIA GPU. For example:
    .. code-block:: none
        cat <<EOF > nvidia-usage-example-pod.yml
        apiVersion: v1
        kind: Pod
        metadata:
          name: nvidia-usage-example-pod
        spec:
          runtimeClassName: nvidia
          containers:
           - name: nvidia-usage-example-pod
              image: nvidia/samples:cuda10.2-vectorAdd
              imagePullPolicy: IfNotPresent    command: [ "/bin/bash", "-c", "--" ]
             args: [ "while true; do sleep 300000; done;" ]
             resources:
               requests:
                 nvidia.com/gpu: 1
               limits:
                 nvidia.com/gpu: 1
        EOF
 #.  Create a pod using the following command.
    .. code-block:: none
        ~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml
 #.  Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table.
    .. code-block:: none
        ~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi
        +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
        |                               |                      |               MIG M. |
        |===============================+======================+======================|
        |   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
        | N/A   28C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
        |                               |                      |                  N/A |
        +-------------------------------+----------------------+----------------------+
        +-----------------------------------------------------------------------------+
        | Processes:                                                                  |
        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
        |        ID   ID                                                   Usage      |
        |=============================================================================|
        |  No running processes found                                                 |
        +-----------------------------------------------------------------------------+
    For information on deleting the GPU Operator, see :ref:`Delete the GPU
    Operator <delete-the-gpu-operator>`.
--- a/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst
+++ b/doc/source/node_management/kubernetes/hardware_acceleration_devices/delete-the-gpu-operator.rst
@ -0,0 +1,59 @@
 .. nsr1616019467549
 .. _delete-the-gpu-operator:
 =======================
 Delete the GPU Operator
 =======================
 |release-caveat|
 Use the commands in this section to delete the GPU Operator, if required.
 .. rubric:: |prereq|
 Ensure that all user generated pods with access to `nvidia.com/gpu` resources are deleted first.
 .. rubric:: |proc|
 #.  Remove the GPU Operator pods from the system using the following commands:
    .. code-block:: none
        ~(keystone_admin)]$ helm delete --purge gpu-operator
        ~(keystone_admin)]$ kubectl delete runtimeclasses.node.k8s.io nvidia
 #.  Remove the GPU Operator, and remove the service parameter platform
    `container\_runtime custom\_container\_runtime` from the system, using the
    following commands:
    #.  Lock the host\(s\).
        .. code-block:: none
            ~(keystone_admin)]$ system host-lock <hostname>
    #.  List the service parameter using the following command.
        .. code-block:: none
            ~(keystone_admin)]$ system service-parameter-list
    #.  Remove the service parameter platform `container\_runtime custom\_container\_runtime`
        from the system, using the following command.
        .. code-block:: none
            ~(keystone_admin)]$ system service-parameter-delete <service param ID>
        where ``<service param ID>`` is the ID of the service parameter, for example, 3c509c97-92a6-4882-a365-98f1599a8f56.
    #.  Unlock the hosts\(s\).
        .. code-block:: none
            ~(keystone_admin)]$ system host-unlock <hostname>
    For information on configuring the GPU Operator, see :ref:`Configure NVIDIA
    GPU Operator for PCI Passthrough Operator
    <configure-nvidia-gpu-operator-for-pci-passthrough>`.
--- a/doc/source/node_management/kubernetes/index.rst
+++ b/doc/source/node_management/kubernetes/index.rst
@ -273,17 +273,6 @@ Node inventory tasks
 Hardware acceleration devices
 -----------------------------
 .. toctree::
   :maxdepth: 1
   hardware_acceleration_devices/uploading-a-device-image
   hardware_acceleration_devices/listing-uploaded-device-images
   hardware_acceleration_devices/listing-device-labels
   hardware_acceleration_devices/removing-a-device-image
   hardware_acceleration_devices/removing-a-device-label
   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
   hardware_acceleration_devices/displaying-the-status-of-device-images
 ************************
 Intel N3000 FPGA support
 ************************
@ -295,8 +284,22 @@ Intel N3000 FPGA support
   hardware_acceleration_devices/updating-an-intel-n3000-fpga-image
   hardware_acceleration_devices/n3000-fpga-forward-error-correction
   hardware_acceleration_devices/showing-details-for-an-fpga-device
   hardware_acceleration_devices/uploading-a-device-image
   hardware_acceleration_devices/common-device-management-tasks
 Common device management tasks
 ******************************
 .. toctree::
   :maxdepth: 2
   hardware_acceleration_devices/listing-uploaded-device-images
   hardware_acceleration_devices/listing-device-labels
   hardware_acceleration_devices/removing-a-device-image
   hardware_acceleration_devices/removing-a-device-label
   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
   hardware_acceleration_devices/displaying-the-status-of-device-images
 ***********************************************
 vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
 ***********************************************
@ -306,6 +309,17 @@ vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
   hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads
   hardware_acceleration_devices/set-up-pods-to-use-sriov
 *******************
 NVIDIA GPU Operator
 *******************
 .. toctree::
   :maxdepth: 1
   hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough
   hardware_acceleration_devices/delete-the-gpu-operator
 ------------------------
 Host hardware management
 ------------------------
--- a/doc/source/planning/kubernetes/verified-commercial-hardware.rst
+++ b/doc/source/planning/kubernetes/verified-commercial-hardware.rst
@ -176,6 +176,10 @@ Verified and approved hardware components for use with |prod| are listed here.
    | Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | -   ACC100 Adapter \(Mount Bryce\) - SRIOV only                                                                                                                                                                                                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | GPUs Verified for PCI Passthrough                                              | -   NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\)                                                                                                                                                                                                                                                                                                                                                       |
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |                                                                                | -   NVIDIA T4 TENSOR CORE GPU                                                                                                                                                                                                                                                                                                                                                                                                          |
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Board Management Controllers                                                   | -   HPE iLO3                                                                                                                                                                                                                                                                                                                                                                                                                           |
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |