NVIDIA AI Operations NCP-AIO Exam Practice Test

Page: 1 / 14
Total 66 questions
Question 1

You have successfully pulled a TensorFlow container from NGC and now need to run it on your stand-alone GPU-enabled server.

Which command should you use to ensure that the container has access to all available GPUs?



Answer : D

Comprehensive and Detailed Explanation From Exact Extract:

When running a GPU-enabled container directly on a server with Docker, the flag --gpus all is required to allow the container access to all GPUs on the host system. This ensures that the TensorFlow container can utilize GPU resources fully. The other options either do not specify GPU access correctly or are Kubernetes-specific commands.


Question 2

A system administrator is looking to set up virtual machines in an HGX environment with NVIDIA Fabric Manager.

What three (3) tasks will Fabric Manager accomplish? (Choose three.)



Answer : A, C, D

Comprehensive and Detailed Explanation From Exact Extract:

NVIDIA Fabric Manager is responsible for managing the fabric interconnect in HGX systems, including:

Configuring routing among NVSwitch ports (A) to optimize communication paths.

Coordinating with the NVSwitch driver to train NVSwitch-to-NVSwitch NVLink interconnects (C) for high-speed link setup.

Coordinating with the GPU driver to initialize and train NVSwitch-to-GPU NVLink interconnects (D) ensuring optimal connectivity between GPUs and switches.

Installing the GPU operator and vGPU driver is typically handled separately and not part of Fabric Manager's core tasks.


Question 3

A system administrator wants to run these two commands in Base Command Manager.

main

showprofile device status apc01

What command should the system administrator use from the management node system shell?



Answer : A

Comprehensive and Detailed Explanation From Exact Extract:

The Base Command Manager command shell (cmsh) accepts the -c flag to execute multiple commands sequentially. Using cmsh -c ''main showprofile; device status apc01'' runs the main showprofile followed by device status apc01 commands in one invocation, allowing scripted or batch execution from the management node shell.


Question 4

A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.

How should they troubleshoot this issue?



Answer : B

Comprehensive and Detailed Explanation From Exact Extract:

Misconfiguration related to MIG mode can cause Slurm to improperly allocate GPUs, leading to job failures. The administrator should verify whether MIG has been enabled on the GPUs and ensure that Slurm's configuration matches the hardware setup. If MIG is enabled, Slurm must be configured to recognize and schedule MIG partitions correctly to avoid resource conflicts.


Question 5

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.

To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?



Answer : C

Comprehensive and Detailed Explanation From Exact Extract:

When automating tasks with the Run:AI Administrator CLI, it is essential to ensure that the Kubernetes configuration file (kubeconfig) is correctly set up with cluster administrative rights. This enables the CLI to interact programmatically with the Kubernetes API for managing nodes, resources, and workloads efficiently. Without proper administrative permissions in the kubeconfig, automated operations will fail due to insufficient rights.

Manual GPU allocation is typically handled by scheduling policies rather than CLI manual assignments. The CLI does not replace kubectl commands entirely, and installation on Windows is not a critical requirement.

The Run:AI Administrator CLI requires a Kubernetes configuration file with cluster-administrative rights in order to perform automation or scripting tasks across the cluster. Without those rights, the CLI cannot manage nodes or resources programmatically.


Question 6

You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.

What is a key requirement for enabling cloudbursting across multiple cloud providers?



Answer : C

Comprehensive and Detailed Explanation From Exact Extract:

When configuring BCM for cloudbursting across multiple cloud providers such as AWS and Azure, it is necessary to configure separate credentials for each cloud provider within BCM. This allows BCM to authenticate and manage resources appropriately in each distinct cloud environment. BCM does not automatically replicate or detect credentials, nor can a single credential set typically work across providers.


Question 7

What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?



Answer : D

Comprehensive and Detailed Explanation From Exact Extract:

Slow GPU-to-GPU communication in distributed systems often relates to the configuration of communication libraries such as NCCL (NVIDIA Collective Communications Library) or NVSHMEM. Ensuring these libraries are properly configured and optimized is critical for efficient GPU communication. Limiting GPUs or increasing RAM does not directly improve communication speed, and disabling InfiniBand would degrade performance.


Page:    1 / 14   
Total 66 questions