NVIDIA AI Infrastructure NCP-AII Exam Questions

Page: 1 / 14
Total 71 questions
Question 1

During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?



Answer : C

In large-scale InfiniBand fabrics, such as those in NVIDIA DGX SuperPODs, maintaining an exact cabling topology is mandatory for the Adaptive Routing and Fat-Tree algorithms to function correctly. A 'Wrong-neighbor' error occurs when the Unified Fabric Manager (UFM) detects that a cable is connected to a port other than the one specified in the master topology map (often a .csv or .topology file). UFM uses LLDP (Link Layer Discovery Protocol) or Subnet Management packets to identify the GUIDs on both ends of a link. The most efficient remediation is to cross-reference the live LLDP data provided by UFM with the intended design. This allows the engineer to identify if the error is a physical mis-cabling (swapped ports) or a logical error in the topology file. Rebooting switches (Option A) will not fix a physical patch error, and disabling FEC (Option D) would lead to catastrophic signal loss on 400G (NDR) links without addressing the underlying routing logic issue. Correcting the physical patch or updating the topology file ensures the fabric's 'Ground Truth' is restored.


Question 2

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager (BCM) cluster. Which two of the following actions are essential for a successful OS installation on the cluster's head node? (Pick the 2 correct responses below)



Answer : B, D

Setting up the head node is the foundational step in building an NVIDIA Base Command Manager cluster. Option B is essential because BCM is typically deployed via a specialized 'installer ISO' that contains the customized OS (RHEL or Ubuntu base) and the Bright Cluster Manager software stack. Verifying the checksum ensures that no corruption occurred during download, preventing mysterious installation failures. Option D is equally critical because AI clusters rely heavily on synchronized clocks for log aggregation, authentication tokens (like LDAP/Active Directory), and performance monitoring across multiple nodes. If NTP (Network Time Protocol) is not configured during the initial wizard, the resulting time drift can cause the cmdaemon to fail synchronization between the head node and the compute nodes. Modern NVIDIA DGX systems require UEFI boot; therefore, Legacy mode (Option C) is incorrect. PXE configuration (Option A) is a post-installation task that is managed by the head node once the BCM software is already running.


Question 3

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?



Answer : D

Updating an NVIDIA DGX system (like the H100) is a multi-layered process because the system contains numerous programmable logic devices, including CPLDs, FPGAs, and the EROT (Electrically Resilient Root of Trust) modules. Many of these low-level hardware components cannot be updated via a simple operating system reboot. NVIDIA's official firmware update procedure requires a specific sequence to 'commit' the new images to the hardware. First, the update utility (like nvfwupd) writes the images to the flash memory. To activate them, a 'Cold Power Cycle' (removing and restoring power) is necessary to force the hardware to reload from the newly written flash blocks. Furthermore, because the BMC (Baseboard Management Controller) orchestrates the power-on sequence and monitors the EROT, it must be reset (Option D) to synchronize its state with the new component versions. Finally, an 'AC Power Cycle' ensures that even the standby-power components, such as the power delivery controllers and CPLDs, undergo a full hardware reset. Skipping these steps can result in 'Incomplete' or 'Mismatched' firmware versions, where the OS reports one version while the hardware continues to run old, potentially buggy code in the background.


Question 4

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?



Answer : B

On an NVIDIA DGX system, the Baseboard Management Controller (BMC) is an independent processor that runs even if the main CPU and Operating System fail to load. If a server reboots and the administrator can access the BMC web interface or IPMI console, but the OS (Ubuntu/DGX OS) does not load, the most likely cause is a boot disk failure. The DGX H100 uses NVMe drives in a RAID-1 configuration for the OS boot volume. If both drives in the mirror fail, or if the boot partition becomes corrupted, the system will hang at the BIOS or UEFI prompt, unable to find a bootable device. While failed power supplies (Option D) or network links (Option A) can cause issues, they would typically prevent the BMC from being reachable at all or prevent remote network traffic respectively. A GPU failure (Option C) would not stop the OS from booting; the system would simply boot with a degraded GPU count. Therefore, checking the storage health via the BMC 'Storage' logs is the correct diagnostic step.


Question 5

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?



Answer : D

Deep learning training involves iterating over a dataset many times (epochs). If a 32-node cluster pulls the same dataset from a central NFS storage server for every epoch, the network and storage fabric quickly become a bottleneck due to 'Incast' traffic. By using the high-speed NVMe drives internal to a DGX system (configured in RAID-0 for maximum performance, not redundancy), the system can implement a local cache. During the first epoch, data is pulled from the remote storage and simultaneously written to the local SSDs. For all subsequent epochs, the training framework reads the data directly from the local RAID-0 array. This significantly reduces NFS traffic and network congestion, allowing the training to proceed at the full speed of the local NVMe storage ($25\text{ GB/s}+$ on modern DGX systems). Option C is incorrect because RAID-0 provides no redundancy; if a drive fails, the cache is lost, but since it is just a cache, the data still exists on the primary storage. Option B refers to GPUDirect Storage, which is a separate technology from local RAID-0 caching.


Question 6

After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?



Answer : B

A 24-hour stress test (using tools like HPL or NCCL) is designed to push the thermal and electrical limits of a DGX system. To verify a 'Pass,' the administrator must ensure that the hardware maintained its performance targets without degradation. Consistent GPU utilization >95% confirms that the workload successfully saturated the compute cores for the entire duration. Crucially, the absence of thermal throttling events (verified via nvidia-smi -q -d PERFORMANCE) ensures that the system's cooling solution (fans and heatsinks) is adequate for the environment; if throttling occurred, the GPUs would have slowed down to protect themselves, indicating a potential cooling failure or environmental heat issue. While power consumption (Option D) and CPU usage (Option A) are interesting, they are not the primary indicators of 'Stability' under extreme AI training loads. System stability is defined by the ability to run at peak speeds indefinitely without hardware-level interventions or slowdowns.


Question 7

An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU's BMC. Which Redfish API command provides this information?



Answer : D

Modern NVIDIA BlueField DPUs include an integrated Baseboard Management Controller (BMC) that supports the industry-standard Redfish API for out-of-band management. While CLI tools like mlxconfig (Option A) or mstflint (Option C) can be used from the host OS to check the NIC firmware, they cannot easily query the BMC-specific components like the ARM Trusted Firmware (ATF), the Board Support Package (BSP), or the UEFI bootloader of the DPU. The Redfish standard specifies a common URI for hardware inventory. The FirmwareInventory endpoint (Option D) is the correct RESTful path to retrieve a comprehensive JSON object containing the versioning details for all firmware-controllable components on the DPU. This is the preferred method for automated data center management systems (like NVIDIA Base Command Manager) to verify that DPUs are at the correct 'Golden Image' version during the staging phase. Note that 'FirmwareList' (Option B) is not a standard Redfish URI for this specific data.


Page:    1 / 14   
Total 71 questions