New Content for NVIDIA T4 GPU Support
Created 2 new topics in Node management - HW Acceleration Devices: - Configure NVIDIA GPU Operator for PCI Passthrough - Delete the GPU Operator Patch 4: Added NVIDIA information in Planning - Verified Comm HW Patch 5: Acted on Greg's comment Patch 6: updated Index as requested in review worked on comments from Ghada Patch 7 and 8: acted on Mary's comments Added 'release-caveat' Acted on Ron's comments Story: 2008434 Task: 42220 https://review.opendev.org/c/starlingx/docs/+/785251 Signed-off-by: Adil <mohamed.adilassakkali@windriver.com> Change-Id: I337e33e805d89621436b35c238aca800b0727e0b
This commit is contained in:
parent
1521b4c4a9
commit
3053ff6e40
@ -0,0 +1,171 @@
|
|||||||
|
|
||||||
|
.. fgy1616003207054
|
||||||
|
.. _configure-nvidia-gpu-operator-for-pci-passthrough:
|
||||||
|
|
||||||
|
=================================================
|
||||||
|
Configure NVIDIA GPU Operator for PCI Passthrough
|
||||||
|
=================================================
|
||||||
|
|
||||||
|
|release-caveat|
|
||||||
|
|
||||||
|
This section provides instructions for configuring NVIDIA GPU Operator.
|
||||||
|
|
||||||
|
.. rubric:: |context|
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
NVIDIA GPU Operator is only supported for standard performance kernel
|
||||||
|
profile. There is no support provided for low-latency performance kernel
|
||||||
|
profile.
|
||||||
|
|
||||||
|
NVIDIA GPU Operator automates the installation, maintenance, and management of
|
||||||
|
NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that
|
||||||
|
require nvidia.com/gpu resources.
|
||||||
|
|
||||||
|
NVIDIA GPU Operator is delivered as a Helm chart to install a number of services
|
||||||
|
and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA
|
||||||
|
software components. These components include:
|
||||||
|
|
||||||
|
.. _fgy1616003207054-ul-sng-blk-z4b:
|
||||||
|
|
||||||
|
- NVIDIA drivers \(to enable CUDA which is a parallel computing platform\)
|
||||||
|
|
||||||
|
- Kubernetes device plugin for GPUs
|
||||||
|
|
||||||
|
- NVIDIA Container Runtime
|
||||||
|
|
||||||
|
- Automatic Node labelling
|
||||||
|
|
||||||
|
- DCGM \(NVIDIA Data Center GPU Manager\) based monitoring
|
||||||
|
|
||||||
|
.. rubric:: |prereq|
|
||||||
|
|
||||||
|
Download the **gpu-operator-v3-1.6.0.3.tgz** file at
|
||||||
|
`http://mirror.starlingx.cengn.ca/mirror/starlingx/
|
||||||
|
<http://mirror.starlingx.cengn.ca/mirror/starlingx/>`__.
|
||||||
|
|
||||||
|
Use the following steps to configure the GPU Operator container:
|
||||||
|
|
||||||
|
.. rubric:: |proc|
|
||||||
|
|
||||||
|
#. Lock the hosts\(s\).
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system host-lock <hostname>
|
||||||
|
|
||||||
|
#. Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
|
||||||
|
|
||||||
|
#. Unlock the hosts\(s\). Once the system is unlocked, the system will reboot automatically.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system host-unlock <hostname>
|
||||||
|
|
||||||
|
#. Create the RuntimeClass resource definition and apply it to the system.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
cat > nvidia.yml << EOF
|
||||||
|
kind: RuntimeClass
|
||||||
|
apiVersion: node.k8s.io/v1beta1
|
||||||
|
metadata:
|
||||||
|
name: nvidia
|
||||||
|
handler: nvidia
|
||||||
|
EOF
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ kubectl apply -f nvidia.yml
|
||||||
|
|
||||||
|
#. Install the GPU Operator Helm charts.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ helm install -–name gpu-operator /path/to/gpu-operator-1.6.0.3.tgz
|
||||||
|
|
||||||
|
#. Check if the GPU Operator is deployed using the following command.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ kubectl get pods –A
|
||||||
|
NAMESPACE NAME READY STATUS RESTART AGE
|
||||||
|
default g-node.. 1/1 Running 1 7h54m
|
||||||
|
default g-node.. 1/1 Running 1 7h54m
|
||||||
|
default gpu-ope. 1/1 Running 1 7h54m
|
||||||
|
gpu-operator-resources gpu-.. 1/1 Running 4 28m
|
||||||
|
gpu-operator-resources nvidia.. 1/1 Running 0 28m
|
||||||
|
gpu-operator-resources nvidia.. 1/1 Running 0 28m
|
||||||
|
gpu-operator-resources nvidia.. 1/1 Running 0 28m
|
||||||
|
gpu-operator-resources nvidia.. 0/1 Completed 0 7h53m
|
||||||
|
gpu-operator-resources nvidia.. 1/1 Running 0 28m
|
||||||
|
|
||||||
|
The plugin validation pod is marked completed.
|
||||||
|
|
||||||
|
#. Check if the nvidia.com/gpu resources are available using the following command.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ kubectl describe nodes <hostname> | grep nvidia
|
||||||
|
|
||||||
|
#. Create a pod that uses the NVIDIA RuntimeClass and requests a
|
||||||
|
nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch
|
||||||
|
a pod NVIDIA GPU. For example:
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
cat <<EOF > nvidia-usage-example-pod.yml
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Pod
|
||||||
|
metadata:
|
||||||
|
name: nvidia-usage-example-pod
|
||||||
|
spec:
|
||||||
|
runtimeClassName: nvidia
|
||||||
|
containers:
|
||||||
|
- name: nvidia-usage-example-pod
|
||||||
|
image: nvidia/samples:cuda10.2-vectorAdd
|
||||||
|
imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ]
|
||||||
|
args: [ "while true; do sleep 300000; done;" ]
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
EOF
|
||||||
|
|
||||||
|
#. Create a pod using the following command.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml
|
||||||
|
|
||||||
|
#. Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi
|
||||||
|
+-----------------------------------------------------------------------------+
|
||||||
|
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|
||||||
|
|-------------------------------+----------------------+----------------------+
|
||||||
|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||||
|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
||||||
|
| | | MIG M. |
|
||||||
|
|===============================+======================+======================|
|
||||||
|
| 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 |
|
||||||
|
| N/A 28C P8 14W / 70W | 0MiB / 15109MiB | 0% Default |
|
||||||
|
| | | N/A |
|
||||||
|
+-------------------------------+----------------------+----------------------+
|
||||||
|
|
||||||
|
+-----------------------------------------------------------------------------+
|
||||||
|
| Processes: |
|
||||||
|
| GPU GI CI PID Type Process name GPU Memory |
|
||||||
|
| ID ID Usage |
|
||||||
|
|=============================================================================|
|
||||||
|
| No running processes found |
|
||||||
|
+-----------------------------------------------------------------------------+
|
||||||
|
|
||||||
|
For information on deleting the GPU Operator, see :ref:`Delete the GPU
|
||||||
|
Operator <delete-the-gpu-operator>`.
|
@ -0,0 +1,59 @@
|
|||||||
|
|
||||||
|
.. nsr1616019467549
|
||||||
|
.. _delete-the-gpu-operator:
|
||||||
|
|
||||||
|
=======================
|
||||||
|
Delete the GPU Operator
|
||||||
|
=======================
|
||||||
|
|
||||||
|
|release-caveat|
|
||||||
|
|
||||||
|
Use the commands in this section to delete the GPU Operator, if required.
|
||||||
|
|
||||||
|
.. rubric:: |prereq|
|
||||||
|
|
||||||
|
Ensure that all user generated pods with access to `nvidia.com/gpu` resources are deleted first.
|
||||||
|
|
||||||
|
.. rubric:: |proc|
|
||||||
|
|
||||||
|
#. Remove the GPU Operator pods from the system using the following commands:
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ helm delete --purge gpu-operator
|
||||||
|
~(keystone_admin)]$ kubectl delete runtimeclasses.node.k8s.io nvidia
|
||||||
|
|
||||||
|
#. Remove the GPU Operator, and remove the service parameter platform
|
||||||
|
`container\_runtime custom\_container\_runtime` from the system, using the
|
||||||
|
following commands:
|
||||||
|
|
||||||
|
#. Lock the host\(s\).
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system host-lock <hostname>
|
||||||
|
|
||||||
|
#. List the service parameter using the following command.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system service-parameter-list
|
||||||
|
|
||||||
|
#. Remove the service parameter platform `container\_runtime custom\_container\_runtime`
|
||||||
|
from the system, using the following command.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system service-parameter-delete <service param ID>
|
||||||
|
|
||||||
|
where ``<service param ID>`` is the ID of the service parameter, for example, 3c509c97-92a6-4882-a365-98f1599a8f56.
|
||||||
|
|
||||||
|
#. Unlock the hosts\(s\).
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
~(keystone_admin)]$ system host-unlock <hostname>
|
||||||
|
|
||||||
|
For information on configuring the GPU Operator, see :ref:`Configure NVIDIA
|
||||||
|
GPU Operator for PCI Passthrough Operator
|
||||||
|
<configure-nvidia-gpu-operator-for-pci-passthrough>`.
|
@ -273,17 +273,6 @@ Node inventory tasks
|
|||||||
Hardware acceleration devices
|
Hardware acceleration devices
|
||||||
-----------------------------
|
-----------------------------
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 1
|
|
||||||
|
|
||||||
hardware_acceleration_devices/uploading-a-device-image
|
|
||||||
hardware_acceleration_devices/listing-uploaded-device-images
|
|
||||||
hardware_acceleration_devices/listing-device-labels
|
|
||||||
hardware_acceleration_devices/removing-a-device-image
|
|
||||||
hardware_acceleration_devices/removing-a-device-label
|
|
||||||
hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
|
|
||||||
hardware_acceleration_devices/displaying-the-status-of-device-images
|
|
||||||
|
|
||||||
************************
|
************************
|
||||||
Intel N3000 FPGA support
|
Intel N3000 FPGA support
|
||||||
************************
|
************************
|
||||||
@ -295,8 +284,22 @@ Intel N3000 FPGA support
|
|||||||
hardware_acceleration_devices/updating-an-intel-n3000-fpga-image
|
hardware_acceleration_devices/updating-an-intel-n3000-fpga-image
|
||||||
hardware_acceleration_devices/n3000-fpga-forward-error-correction
|
hardware_acceleration_devices/n3000-fpga-forward-error-correction
|
||||||
hardware_acceleration_devices/showing-details-for-an-fpga-device
|
hardware_acceleration_devices/showing-details-for-an-fpga-device
|
||||||
|
hardware_acceleration_devices/uploading-a-device-image
|
||||||
hardware_acceleration_devices/common-device-management-tasks
|
hardware_acceleration_devices/common-device-management-tasks
|
||||||
|
|
||||||
|
Common device management tasks
|
||||||
|
******************************
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
hardware_acceleration_devices/listing-uploaded-device-images
|
||||||
|
hardware_acceleration_devices/listing-device-labels
|
||||||
|
hardware_acceleration_devices/removing-a-device-image
|
||||||
|
hardware_acceleration_devices/removing-a-device-label
|
||||||
|
hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
|
||||||
|
hardware_acceleration_devices/displaying-the-status-of-device-images
|
||||||
|
|
||||||
***********************************************
|
***********************************************
|
||||||
vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
|
vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
|
||||||
***********************************************
|
***********************************************
|
||||||
@ -306,6 +309,17 @@ vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
|
|||||||
hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads
|
hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads
|
||||||
hardware_acceleration_devices/set-up-pods-to-use-sriov
|
hardware_acceleration_devices/set-up-pods-to-use-sriov
|
||||||
|
|
||||||
|
|
||||||
|
*******************
|
||||||
|
NVIDIA GPU Operator
|
||||||
|
*******************
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough
|
||||||
|
hardware_acceleration_devices/delete-the-gpu-operator
|
||||||
|
|
||||||
|
|
||||||
------------------------
|
------------------------
|
||||||
Host hardware management
|
Host hardware management
|
||||||
------------------------
|
------------------------
|
||||||
|
@ -176,6 +176,10 @@ Verified and approved hardware components for use with |prod| are listed here.
|
|||||||
| Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | - ACC100 Adapter \(Mount Bryce\) - SRIOV only |
|
| Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | - ACC100 Adapter \(Mount Bryce\) - SRIOV only |
|
||||||
+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| GPUs Verified for PCI Passthrough | - NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\) |
|
| GPUs Verified for PCI Passthrough | - NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\) |
|
||||||
|
| | |
|
||||||
|
| | - NVIDIA T4 TENSOR CORE GPU |
|
||||||
|
| | |
|
||||||
|
| | |
|
||||||
+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| Board Management Controllers | - HPE iLO3 |
|
| Board Management Controllers | - HPE iLO3 |
|
||||||
| | |
|
| | |
|
||||||
|
Loading…
x
Reference in New Issue
Block a user