NVIDIA AI Infrastructure NCP-AII Exam Questions

Page: 1 / 14
Total 71 questions
Question 1

A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?



Answer : D

High-Performance Linpack (HPL) is the standard benchmark for stress-testing the computational stability and thermal endurance of an AI cluster. It solves a massive dense system of linear equations, and its mathematical configuration is highly sensitive. The HPL.dat configuration file defines the Problem Size ($N$) and the Block Size ($NB$). A fundamental requirement of the HPL algorithm is that the workload must be distributed evenly across the MPI processes and GPU threads. If the total matrix size $N$ is not an exact multiple of the block size $NB$, or if the grid dimensions ($P \times Q$) do not align with the hardware topology, the solver may encounter an 'illegal value' error or a 'residual too large' failure at the very beginning of the run. This is a configuration error, not a hardware fault. Reducing the precision (Option A) would invalidate the test, as HPL must run in FP64 to be considered a standard 'burn-in.' Verifying that $N$ is divisible by $NB$ ensures the mathematical integrity of the test while allowing the hardware to be pushed to its theoretical performance limits.


Question 2

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?



Answer : C

UEFI Secure Boot is a security standard that ensures only digitally signed code is allowed to execute during the boot process. Since NVIDIA GPU drivers include kernel modules (nvidia.ko), they must be signed by a key trusted by the system's firmware. When drivers are installed on a DGX system with Secure Boot active, the installation process generates a unique Machine Owner Key (MOK). However, the Linux kernel will not trust this key until the user manually authenticates it at the 'Shim' level before the OS loads. Upon the first reboot after installation, the system enters the 'MOK Management' blue screen. The administrator must select 'Enroll MOK' and enter the temporary password created during the driver installation. Failing to do this results in the kernel rejecting the nvidia module, leading to an 'Unable to determine the device handle for GPU' error in nvidia-smi. Disabling Secure Boot (Option A) would resolve the symptom but violates the security posture of the AI infrastructure.


Question 3

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?



Answer : D

Deep learning training involves iterating over a dataset many times (epochs). If a 32-node cluster pulls the same dataset from a central NFS storage server for every epoch, the network and storage fabric quickly become a bottleneck due to 'Incast' traffic. By using the high-speed NVMe drives internal to a DGX system (configured in RAID-0 for maximum performance, not redundancy), the system can implement a local cache. During the first epoch, data is pulled from the remote storage and simultaneously written to the local SSDs. For all subsequent epochs, the training framework reads the data directly from the local RAID-0 array. This significantly reduces NFS traffic and network congestion, allowing the training to proceed at the full speed of the local NVMe storage ($25\text{ GB/s}+$ on modern DGX systems). Option C is incorrect because RAID-0 provides no redundancy; if a drive fails, the cache is lost, but since it is just a cache, the data still exists on the primary storage. Option B refers to GPUDirect Storage, which is a separate technology from local RAID-0 caching.


Question 4

During HPL execution on a DGX cluster, the benchmark fails with "not enough memory" errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?



Answer : A

High-Performance Linpack (HPL) is a memory-intensive benchmark that allocates a large portion of available GPU memory to store the matrix $N$. While a server may have 2TB of physical system RAM, the 'not enough memory' error usually refers to the HBM (High Bandwidth Memory) on the GPUs themselves. In a DGX H100 system, each GPU has 80GB of HBM3. If the problem size ($N$) specified in the HPL.dat file is too large, the required memory for the matrix will exceed the aggregate capacity of the GPU memory. Reducing the problem size ($N$) while maintaining the optimal block size ($NB$) ensures that the problem fits within the GPU memory limits while still pushing the computational units to their peak performance. Increasing the block size (Option C) would actually increase the memory footprint of certain internal buffers, potentially worsening the issue. Reducing $N$ is the standard procedure to stabilize the run during the initial tuning phase of an AI cluster bring-up.


Question 5

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?



Answer : C

NVIDIA Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture (A100) and enhanced in Hopper (H100), allows a single physical GPU to be partitioned into multiple isolated instances. To manage these instances and their scheduling behavior (such as moving from a default time-sliced scheduler to a fixed-share or 'MIG' partitioned mode), the nvidia-smi utility is used. The command nvidia-smi -i 0 -mig 1 enables MIG mode on the first GPU (index 0). Once MIG is enabled, the administrator can create specific GPU instances with dedicated compute and memory resources. This 'Fixed Share' approach ensures that one tenant's workload does not impact the performance or latency of another, which is critical for deterministic AI inference and multi-tenant cloud environments. Option A and B refer to VMware ESXi specific commands which are not the primary method for raw hardware configuration in standard AI infrastructure, and Option D is for network adapter configuration.


Question 6

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?



Answer : D

In a large-scale Spectrum-X Ethernet fabric, 'East-West' traffic refers to the cross-rack communication between compute nodes. To validate the 'Bisectional Bandwidth' (the throughput between two halves of the cluster), administrators use NCCL tests with specific environment variables to control traffic patterns. The NCCL_TESTS_SPLIT variable is used to partition the GPUs into distinct groups for the benchmark. Setting NCCL_TESTS_SPLIT='DIV 8' is a standard configuration for multi-node testing on 8-GPU systems. It effectively divides the total number of GPUs by the node count, creating a test environment where each GPU communicates with its corresponding rank on other nodes. By combining this with -g 1 (one GPU per process) across multiple nodes, the engineer can force data to travel across the leaf-and-spine switches rather than staying within the NVLink fabric of a single node. This isolates the physical network performance from the internal GPU-to-GPU bandwidth, providing a true measurement of the fabric's ability to handle high-speed AI traffic.


Question 7

During a multi-day NeMo burn-in, intermittent "GPU fell off bus" errors occur. Which diagnostic approach isolates hardware faults?



Answer : B

The error 'GPU fell off bus' is a critical failure where the PCIe link between the GPU and the CPU/PCIe Switch has collapsed, often due to thermal stress, power instability, or physical hardware defects. To isolate the root cause during an intensive workload like NVIDIA NeMo (Large Language Model framework), the administrator must collect high-fidelity telemetry. DCGM (Data Center GPU Manager) diagnostics are designed for exactly this scenario. By running dcgmi diag -r 3 (a comprehensive hardware stress test) or monitoring health via dcgmi health --check concurrently with the workload, the system can capture the exact moment parameters like PCIe replay counts, temperature spikes, or XID errors occur. This data allows the engineer to determine if a specific H100 module is faulty or if the issue is systemic (e.g., a failing PCIe switch on the motherboard). Lowering the workload (Option C or D) might hide the symptom, but it does not diagnose the hardware's inability to handle peak power and data throughput.


Page:    1 / 14   
Total 71 questions