A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
Answer : A
Comprehensive and Detailed Explanation From Exact Extract:
In Run:ai, roles are assigned based on levels of permissions. The L1 Researcher role is designed for users who need access to GPU resources for running jobs and experiments but should not have administrative rights over cluster settings or other users. This role ensures researchers can use resources without affecting cluster configurations or user management. Other roles like Department Administrator, Application Administrator, or Research Manager have broader privileges, including managing users and settings, which are not appropriate for the new researcher's requirements.
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
Answer : B
Comprehensive and Detailed Explanation From Exact Extract:
A best practice for active-passive HA clusters, such as for firewall systems managed via BCM, is to use a heartbeat network to synchronize session state data between active and passive nodes. This real-time synchronization allows the passive node to take over seamlessly in case the active node fails, maintaining session continuity and minimizing downtime. Configuring different zone names or firewall models can cause incompatibility, and manual synchronization is prone to errors and delays.
If a Magnum IO-enabled application experiences delays during the ETL phase, what troubleshooting step should be taken?
Answer : D
Comprehensive and Detailed Explanation From Exact Extract:
Ensuring that GPUDirect Storage is properly configured allows the application to transfer data directly from storage into GPU memory, bypassing the CPU and reducing latency and overhead during the ETL (Extract, Transform, Load) phase. This direct path optimizes data movement, preventing delays and improving performance for Magnum IO-enabled applications.
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
Answer : A
Comprehensive and Detailed Explanation From Exact Extract:
In Slurm GPU resource management, the gres.conf file defines the available GPUs (generic resources) per node, while slurm.conf configures the cluster-wide GPU scheduling policies. To prevent jobs from using GPUs reserved for other purposes (e.g., display rendering GPUs), administrators must ensure that only the GPUs intended for compute workloads are listed in these configuration files.
Properly configuring gres.conf allows Slurm to recognize and expose only those GPUs meant for jobs.
slurm.conf must be aligned to exclude or restrict unconfigured GPUs.
Manual GPU assignment using nvidia-smi is not scalable or integrated with Slurm scheduling.
Reinstalling drivers or increasing GPU requests does not solve resource exclusion.
Thus, the correct approach is to verify and configure GPU listings accurately in gres.conf and slurm.conf to restrict job allocations to intended GPUs.
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
Answer : A
Comprehensive and Detailed Explanation From Exact Extract:
In Kubernetes architecture, the control plane is composed of several core components including the kube-apiserver, etcd (the cluster's key-value store), kube-scheduler, and kube-controller-manager. These manage the overall cluster state, scheduling, and orchestration of workloads. The worker nodes are responsible for running the actual containers and include the kubelet (agent that communicates with the control plane) and kube-proxy (handles network routing for services). Other options incorrectly assign these components or roles.
You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?
Answer : D
Comprehensive and Detailed Explanation From Exact Extract:
For large-scale AI workloads such as distributed training of large language models, the networking infrastructure must deliver extremely low latency and very high throughput to keep GPUs and compute nodes efficiently synchronized. NVIDIA highlights that InfiniBand networking is essential in AI data centers because it provides ultra-low latency, high bandwidth, adaptive routing, congestion control, and noise isolation---features critical for high-performance AI training clusters.
InfiniBand acts not just as a network but as a computing fabric, integrating compute and communication tightly. Microsoft Azure, a leading cloud provider, uses thousands of miles of InfiniBand cabling to meet the demands of their AI workloads, demonstrating its importance. While Ethernet-based solutions like NVIDIA's Spectrum-X are emerging and optimized for AI, InfiniBand remains the premier choice for AI supercomputing networks.
Therefore, for maximizing performance in a new AI cluster focused on distributed training, InfiniBand networking (option D) is the recommended architecture. Other Ethernet-based approaches provide scalability and bandwidth but cannot match InfiniBand's specialized low-latency and high-throughput performance for AI.
What two (2) platforms should be used with Fabric Manager? (Choose two.)
Answer : A, D
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA Fabric Manager is designed to manage and optimize fabric resources like NVLink and NVSwitch in enterprise-class platforms such as HGX and DGX systems. These platforms have the necessary hardware fabric components. The L40S Certified and GeForce series are either not compatible or do not require Fabric Manager.