NVIDIA NCP-AII Exam Questions Instant Access

Question 1

A user encounters "permission denied" errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

AEnroll the MOK and sign NVIDIA kernel modules.

BReinstall Docker without the NVIDIA runtime.

CDisable SELinux to relax unnecessary security policies.

DRun Docker with sudo for elevated privileges.

Answer : A

On systems where UEFI Secure Boot is enabled, the Linux kernel strictly enforces a 'Signature Verification' policy for all loaded kernel modules. The NVIDIA driver consists of several modules (like nvidia.ko and nvidia-uvm.ko) that provide the interface between the hardware and the NVIDIA Container Runtime. If these modules are not signed by a key trusted by the system's firmware, the kernel will block them from loading, leading to 'Permission Denied' errors when the container attempts to access /dev/nvidiactl or the GPU device nodes. To resolve this without compromising the security of the host, the administrator must utilize a Machine Owner Key (MOK). The modules must be signed using this key, and the key itself must be 'Enrolled' into the system's Secure Boot database via the MOK management interface during the boot process. This establishes a 'Chain of Trust' that allows the drivers to function. Disabling SELinux (Option C) or using sudo (Option D) will not resolve this, as the blockage is happening at the kernel-firmware interface level, not at the user-space permission level.

Question 2

An engineer needs to validate NVLink Switch functionality on a DGX H100 system with 8 GPUs. Which NCCL command verifies intra-node NVLink bandwidth?

Abroadcast_perf -b 8 -e 16G -f 2 -g 8 without split configuration

Ball_reduce_perf -b 8 -e 16G -f 2 -g 4 with NCCL_TESTS_SPLIT='MOD 2'

Call_reduce_perf -b 8 -e 16G -f 2 -g 1 repeated 8 times

Dall_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT='OR 0x7'

Answer : D

The NVIDIA Collective Communications Library (NCCL) 'Tests' are used to verify the maximum achievable bandwidth of the interconnects. On a DGX H100, the GPUs are connected via a dedicated high-bandwidth NVLink Switch fabric (NVLink 4), which provides significantly higher throughput than PCIe. To validate the intra-node (within a single server) performance, the all_reduce_perf test is used. The command in Option D is specifically designed to stress all 8 GPUs (-g 8) across a wide range of message sizes (8 bytes to 16G). The use of the environment variable NCCL_TESTS_SPLIT with the bitwise 'OR' or 'AND' masks allows the engineer to isolate specific traffic patterns or groups of GPUs to ensure the NVLink switches are distributing the load evenly. For a standard 8-GPU H100 tray, achieving a 'Bus Bandwidth' of ~450 GB/s to 900 GB/s (depending on the precision and message size) confirms that the NVLink fabric is operating at its theoretical peak. Using only 4 GPUs (Option B) or 1 GPU (Option C) would not provide a complete picture of the NVLink switch bisection bandwidth.

Question 3

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

ARun a deep learning workload to stress test the GPUs and check whether the issue persists.

BCheck the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

CPower drain then restart the DGX and check if the performance degradation resolves.

DIncrease the fan speed to maximum and check whether the performance improves.

Answer : B

When a DGX system exhibits high fan speeds and performance degradation, it is typically engaging in Thermal Throttling. High-performance GPUs like the A100 or H100 will automatically reduce their clock speeds (and thus performance) if they exceed safe temperature thresholds. The first and most critical diagnostic step is to run nvidia-smi. This utility provides immediate, real-time telemetry on GPU temperatures, power draw, and 'Clocks Throttle Reasons.' By reviewing the output, an administrator can see if 'Thermal' is listed as the reason for reduced clocks. This identifies whether the issue is environmental (blocked airflow/hot aisle temperature) or hardware-specific (a failed GPU thermal interface or a dead internal fan). Running more workloads (Option A) would exacerbate the heat, while a power drain (Option C) is a 'last resort' that doesn't provide diagnostic data. nvidia-smi provides the evidentiary data needed to determine if an RMA (Return Merchandise Authorization) is required for the GPU tray.

Question 4

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

AReboot the host system to apply the repository changes and proceed.

BInstall the nvidia-container-toolkit package using your package manager.

CFormat the disk to clear any existing NVIDIA-related dependencies first.

DDownload the CUDA toolkit installer from NVIDIA'S official website.

Answer : B

The NVIDIA Container Toolkit (formerly nvidia-docker2) is the essential middleware that allows Docker, Podman, or Containerd to 'see' and utilize the host's GPU hardware. The standard installation workflow on Debian-based systems like Ubuntu involves three core phases: repository configuration, package installation, and runtime configuration. Once the GPG key is added (to ensure package integrity) and the .list file is placed in /etc/apt/sources.list.d/ (to point to the NVIDIA production servers), the local package index must be refreshed via apt-get update. Immediately following this, the administrator must install the toolkit using the command sudo apt-get install -y nvidia-container-toolkit. Rebooting (Option A) is unnecessary at this stage because no kernel modules have been modified yet. Downloading the CUDA Toolkit (Option D) is a separate step; notably, the Container Toolkit allows containers to run CUDA applications even if the host only has the NVIDIA driver installed, making the driver---not the host CUDA toolkit---the primary prerequisite.

Question 5

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

ALocal SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

BUsing local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

CLocal SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

DA local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Answer : D

Deep learning training involves iterating over a dataset many times (epochs). If a 32-node cluster pulls the same dataset from a central NFS storage server for every epoch, the network and storage fabric quickly become a bottleneck due to 'Incast' traffic. By using the high-speed NVMe drives internal to a DGX system (configured in RAID-0 for maximum performance, not redundancy), the system can implement a local cache. During the first epoch, data is pulled from the remote storage and simultaneously written to the local SSDs. For all subsequent epochs, the training framework reads the data directly from the local RAID-0 array. This significantly reduces NFS traffic and network congestion, allowing the training to proceed at the full speed of the local NVMe storage ($25\text{ GB/s}+$ on modern DGX systems). Option C is incorrect because RAID-0 provides no redundancy; if a drive fails, the cache is lost, but since it is just a cache, the data still exists on the primary storage. Option B refers to GPUDirect Storage, which is a separate technology from local RAID-0 caching.

Question 6

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

ASFP Connectors

BSFP to 1G BASE-T (RJ45) adapter

CQSA Adapter

Answer : C

The QSA (QSFP to SFP Adapter) is a mechanical and electrical bridge that allows a single-lane SFP/SFP28 transceiver (typically 10G or 25G) to be plugged into a four-lane QSFP/QSFP28 switch port. In AI infrastructure, this is commonly used to connect low-speed management servers or legacy nodes to a high-speed backbone switch without wasting entire 100G/200G ports or requiring specialized breakout cables. The QSA adapter maps the single lane of the SFP module to the first lane of the QSFP port. This is a 'pass-through' solution that maintains the signal integrity and latency characteristics of the link. It is the verified hardware solution for port-density mismatch in NVIDIA networking environments.

Question 7

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

AUpdate the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

BPerform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

CUpdate the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

DStop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Answer : D

Updating firmware on an NVIDIA DGX H100 is a multi-stage process that requires careful orchestration to prevent hardware corruption. The first and most critical step is to ensure no workloads are running (stopping all GPU activity) to avoid conflicts during the flashing process. The standard NVIDIA procedure begins with updating and rebooting the Baseboard Management Controller (BMC). This is because the BMC manages the power sequencing and communication for all other trays; having the latest management logic active is a prerequisite for the subsequent steps. Once the BMC is updated and back online, the administrator proceeds with the motherboard and GPU tray updates. However, these updates are staged in flash memory and often do not 'take effect' until the hardware undergoes a cold reset (removing power completely). This physical or logical power cycle forces the various CPLDs and silicon root-of-trust modules to boot from the newly written firmware images. Finally, the administrator must verify completion using tools like nvsm show health or the BMC dashboard to ensure all components report the target versions and a 'Healthy' status. Skipping the BMC update first (Option C) or the cold reset (Option B) can lead to mismatched firmware states that may cause system instability or boot failures.

NVIDIA AI Infrastructure NCP-AII Exam Questions