A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.
Which command should be used?
Answer : C
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command sstat is designed to provide real-time statistics about running jobs, including process-level details and resource usage such as CPU, memory, and GPU utilization. Using sstat -j <jobid> or sstat -j <jobid.step> allows monitoring of active job resource consumption.
smap is not a standard Slurm command.
scontrol show job gives job configuration and status but not real-time resource usage.
sinfo displays node and partition information, not job-specific resource stats.
Therefore, sstat is the correct command for real-time job process and resource monitoring.
A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the ''PENDING'' state and unable to progress to the ''RUNNING'' state.
Which Slurm command can help the user identify the reason for the job's pending status?
Answer : B
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command scontrol show job <jobid> provides detailed information about a specific job, including its current status and, crucially, the reason why a job might be pending. This command shows job details such as resource requirements, dependencies, and any issues blocking the job from running.
sinfo -R displays information about nodes and their reasons for being in various states but does not provide job-specific reasons.
sacct -j shows accounting data for jobs but typically does not explain pending causes.
squeue -u lists jobs by user but does not detail the pending reasons.
Hence, scontrol show job <jobid> is the appropriate command to diagnose why a Slurm job remains in the pending state.
What two (2) platforms should be used with Fabric Manager? (Choose two.)
Answer : A, D
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA Fabric Manager is designed to manage and optimize fabric resources like NVLink and NVSwitch in enterprise-class platforms such as HGX and DGX systems. These platforms have the necessary hardware fabric components. The L40S Certified and GeForce series are either not compatible or do not require Fabric Manager.
An administrator wants to check if the BlueMan service can access the DPU.
How can this be done?
Answer : B
Comprehensive and Detailed Explanation From Exact Extract:
The DOCA Telemetry Service (DTS) is used to monitor and verify the status and accessibility of services like BlueMan on NVIDIA DPUs. It provides telemetry data and health monitoring specific to the DPU and its services. System logs or dump files may provide indirect information but DTS is the targeted tool for this check.
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
Answer : B
Comprehensive and Detailed Explanation From Exact Extract:
A best practice for active-passive HA clusters, such as for firewall systems managed via BCM, is to use a heartbeat network to synchronize session state data between active and passive nodes. This real-time synchronization allows the passive node to take over seamlessly in case the active node fails, maintaining session continuity and minimizing downtime. Configuring different zone names or firewall models can cause incompatibility, and manual synchronization is prone to errors and delays.
After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?
Answer : A
Comprehensive and Detailed Explanation From Exact Extract:
The standard method to verify that worker nodes are correctly registered and ready in a Kubernetes cluster is to run kubectl get nodes. This command lists all nodes and their statuses. Nodes showing a status of ''Ready'' indicates they are properly connected and available to schedule workloads. Checking pods or manual SSH is not the direct or reliable way to verify node readiness.
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
Answer : A
Comprehensive and Detailed Explanation From Exact Extract:
In Slurm GPU resource management, the gres.conf file defines the available GPUs (generic resources) per node, while slurm.conf configures the cluster-wide GPU scheduling policies. To prevent jobs from using GPUs reserved for other purposes (e.g., display rendering GPUs), administrators must ensure that only the GPUs intended for compute workloads are listed in these configuration files.
Properly configuring gres.conf allows Slurm to recognize and expose only those GPUs meant for jobs.
slurm.conf must be aligned to exclude or restrict unconfigured GPUs.
Manual GPU assignment using nvidia-smi is not scalable or integrated with Slurm scheduling.
Reinstalling drivers or increasing GPU requests does not solve resource exclusion.
Thus, the correct approach is to verify and configure GPU listings accurately in gres.conf and slurm.conf to restrict job allocations to intended GPUs.