Aiman Ismail

GPU Infrastructure Troubleshooting

This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.

gpuinfrastructurenetworkingncclinfinibandtroubleshooting

Runbooks and diagnostic procedures for GPU cluster performance issues.

Node-Level: IOMMU, ACS, PCIe

IOMMU

IOMMU sits between PCIe devices and system memory, translating device-virtual addresses to physical addresses. Every DMA transaction goes through this translation.

Normal:    GPU --[DMA]--> Physical Memory
With IOMMU: GPU --[DMA]--> IOMMU --[translate IOVA->PA]--> Physical Memory

Impact: 2x-10x degradation in AllReduce bandwidth due to IOTLB thrashing, ACS routing penalty, and GPUDirect RDMA overhead.

How to Check

1. IOMMU groups count (most reliable)

ls /sys/kernel/iommu_groups/ | wc -l
  • 0 = disabled (good)
  • > 0 = active (bad for GPU perf)

2. IOMMU domain type

cat /sys/kernel/iommu_groups/*/type 2>/dev/null | sort | uniq -c
  • DMA-FQ or DMA = full translation (worst)
  • identity = passthrough (less bad)

3. Kernel command line

cat /proc/cmdline | tr ' ' '\n' | grep -i iommu

4. Per-GPU IOMMU status

for gpu in $(lspci -d 10de: -D | awk '{print $1}'); do
    group=$(readlink /sys/bus/pci/devices/$gpu/iommu_group 2>/dev/null | xargs basename 2>/dev/null)
    if [ -n "$group" ]; then
        dtype=$(cat /sys/kernel/iommu_groups/$group/type 2>/dev/null)
        echo "$gpu -> IOMMU group $group ($dtype) $(lspci -s $gpu)"
    fi
done

Fix: Disable IOMMU

Option 1: Kernel boot parameter (preferred)

# Intel:
GRUB_CMDLINE_LINUX="intel_iommu=off"
# AMD:
GRUB_CMDLINE_LINUX="amd_iommu=off"
sudo update-grub && sudo reboot

Option 2: Passthrough mode (if IOMMU must stay on)

# Intel:
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# AMD:
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"

Option 3: BIOS — Disable VT-d (Intel) or AMD-Vi (AMD). Most thorough but requires physical/IPMI access.

Real-world case: H200 cluster, PXE-booted

Nodes were PXE-booted. Initial image shipped without intel_iommu=on iommu=pt — NCCL all-reduce busbw was capped at ~98-101 GB/s on large messages (target: ~180 GB/s algbw / ~337 GB/s busbw for 16-GPU).

After the PXE boot image was updated with the following params:

intel_iommu=on iommu=pt iomem=relaxed pci=bfsort,realloc=off rd.driver.blacklist=nouveau

busbw jumped from ~103 GB/s → ~484 GB/s — exceeding the original target. iommu=pt (passthrough mode) was the key change enabling GPUDirect RDMA to bypass the CPU.

Note for PXE-booted clusters: kernel params are set by the provisioning server, not /etc/default/grub. To check boot history, use sudo journalctl -b -N -k | grep "Command line:" across boots (-b -1, -b -2, etc.).

ACS (Access Control Services)

ACS controls whether P2P traffic passes through a PCIe switch directly or must route up to the root complex first.

Normal:   GPU0 --[PCIe switch]--> GPU1         (direct)
With ACS: GPU0 --[up to root complex]--> GPU1  (detour)

ACS is often auto-enabled by the kernel when IOMMU is active. IOMMU has a double penalty: translation overhead + ACS forcing suboptimal routing.

How to Check

echo "ACS devices total: $(sudo lspci -vvv 2>/dev/null | grep -c 'ACSCtl:')"
echo "ACS with SrcValid+: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'SrcValid+')"
echo "ACS with ReqRedir+: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'ReqRedir+')"
echo "ACS all bits off: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-')"

Key bits: ReqRedir+ is the perf killer — forces all P2P through root complex.

Quick fleet check:

echo "$(hostname): acs_redir=$(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'ReqRedir+')"

Fix: Disable ACS

for BDF in $(lspci -d '*:*' | awk '{print $1}'); do
    sudo setpci -s $BDF ECAP_ACS+6.w=0000 2>/dev/null
done

PPCIe (Protected PCIe)

Part of NVIDIA's Confidential Computing stack. Encrypts data on the PCIe bus between GPU/NVSwitch and host memory. Some cloud providers ship nodes with this enabled by default.

  • Intra-node: Minimal impact (NVLink bypasses PCIe)
  • Inter-node: Overhead on PCIe segments in the GPU -> NVSwitch -> PCIe -> NIC path

Fix: Disable PPCIe

Uses NVIDIA gpu-admin-tools:

python3 ~/gpu-admin-tools/nvidia_gpu_tools.py --devices gpus --set-ppcie-mode=off --reset-after-ppcie-mode-switch
python3 ~/gpu-admin-tools/nvidia_gpu_tools.py --devices nvswitches --set-ppcie-mode=off --reset-after-ppcie-mode-switch
reboot

Fleet Diagnostics

# With ansible
ansible compute -i inventory.ini -m shell -a \
  'echo "$(hostname): iommu_groups=$(ls /sys/kernel/iommu_groups/ 2>/dev/null | wc -l) acs_redir=$(sudo lspci -vvv 2>/dev/null | grep ACSCtl: | grep -c ReqRedir+)"'

# With pssh
pssh -h hosts.txt -i \
  'echo "$(hostname): iommu_groups=$(ls /sys/kernel/iommu_groups/ 2>/dev/null | wc -l) acs_redir=$(sudo lspci -vvv 2>/dev/null | grep ACSCtl: | grep -c ReqRedir+)"'

Post-fix Verification

dmesg | grep -i iommu
ls /sys/kernel/iommu_groups/
nvidia-smi topo -p2p r
nccl-tests/build/all_reduce_perf -b 8 -e 256M -f 2 -g 8

Recommended Settings for H100/H200 Clusters

Setting Value How
IOMMU (BIOS) Disabled BIOS > VT-d/AMD-Vi > Disabled
IOMMU (kernel) intel_iommu=off or amd_iommu=off /etc/default/grub
PCIe ACS Disabled on GPU bridges setpci or BIOS
PPCIe mode Disabled (off) nvidia_gpu_tools.py --set-ppcie-mode=off on GPUs + NVSwitches
nvidia-peermem Loaded modprobe nvidia-peermem
Persistence mode Enabled nvidia-smi -pm 1

Node-Level: Pre-Job Health Checks

Systematic verification before launching workloads. Can be scripted into Slurm prolog or run manually after node maintenance.

Quick Checklist

# 1. Driver and modules loaded
nvidia-smi > /dev/null && echo "driver: ok" || echo "driver: MISSING"
lsmod | grep -q nvidia_peermem && echo "peermem: ok" || echo "peermem: MISSING"
lsmod | grep -q gdrdrv && echo "gdrdrv: ok" || echo "gdrdrv: MISSING (optional)"

# 2. Persistence mode on
nvidia-smi -q -d PERFORMANCE | grep -i "Persistence" | head -1

# 3. Fabric Manager running (HGX/NVSwitch systems)
systemctl is-active nvidia-fabricmanager

# 4. GPU count matches expected
echo "GPUs: $(nvidia-smi -L | wc -l)"

# 5. PCIe link speed (should be Gen5 x16 for H100/H200)
nvidia-smi -q -d PERFORMANCE | grep -E "Link (Gen|Width)"

# 6. NVLink topology healthy
nvidia-smi nvlink -s

# 7. Recent XID errors (last boot)
dmesg | grep -i "Xid" | tail -5 || echo "clean"

# 8. ECC errors and page retirement
nvidia-smi -q -d ECC | grep -E "Uncorrected|Corrected" | grep -v ": 0"
nvidia-smi -q -d PAGE_RETIREMENT | grep -E "Pending|Cause"

# 9. DCGM quick diagnostic
dcgmi diag -r 2

DCGM Diagnostic Levels

dcgmi diag runs progressively deeper GPU health checks. GPUs must be idle (no running workloads).

Level Command Duration Tests
1 (quick) dcgmi diag -r 1 ~seconds Deployment checks: driver loaded, NVML working, persistence mode, GPU count, permissions
2 (medium) dcgmi diag -r 2 ~2 min Level 1 + PCIe bandwidth, targeted stress, targeted power. Catches degraded PCIe links and GPUs that fail under load
3 (long) dcgmi diag -r 3 ~15 min Level 2 + full GPU memory scan (memtest), diagnostic plugin. Catches intermittent memory errors

Level 2 is the sweet spot for pre-job checks — fast enough to run in a Slurm prolog, thorough enough to catch most hardware issues before they waste job runtime. Level 3 is for post-maintenance validation or investigating suspected hardware issues.

Example Slurm prolog snippet:

#!/bin/bash
dcgmi diag -r 2 > /tmp/dcgm-prolog-$(hostname).log 2>&1
if [ $? -ne 0 ]; then
    scontrol update nodename=$(hostname) state=drain reason="dcgmi diag failed"
    exit 1
fi

Fleet-wide pre-job check

ansible compute -i inventory.ini --become -m shell -a \
  'nvidia-smi > /dev/null && echo "driver: ok" || echo "driver: MISSING"; \
   echo "GPUs: $(nvidia-smi -L | wc -l)"; \
   echo "peermem: $(lsmod | grep -q nvidia_peermem && echo ok || echo MISSING)"; \
   echo "fabmgr: $(systemctl is-active nvidia-fabricmanager)"; \
   echo "XIDs: $(dmesg | grep -ci Xid)"'

Node-Level: Proactive Monitoring

Key metrics to collect continuously (via DCGM, Prometheus dcgm-exporter, or custom scripts).

Metric Source Why
GPU temperature nvidia-smi -q -d TEMPERATURE Thermal throttling degrades perf silently
Memory temperature DCGM DCGM_FI_DEV_MEMORY_TEMP HBM throttles independently from GPU die
Power draw nvidia-smi -q -d POWER Hitting power cap = clock throttling
Clock speeds (SM, Mem) nvidia-smi -q -d CLOCK Lower than max = throttling active
ECC error counts nvidia-smi -q -d ECC Rising corrected errors = failing memory
NVLink error counters nvidia-smi nvlink -e Link degradation before failure
PCIe replay count DCGM DCGM_FI_DEV_PCIE_REPLAY_COUNTER Rising = signal integrity issue
XID error count DCGM DCGM_EXP_XID_ERRORS_COUNT Alert on any XID >= 48

Fabric-Level: InfiniBand Diagnostics

IB PHY Error Monitoring

Metrics

  • ethtool counters: rx_err_lane_*_phy (per-lane physical layer errors on IB interfaces)
  • Exposed via Grafana Alloy's ethtool collector as node_net_ethtool_received_err_lane_*_phy
  • Query error rate with Prometheus rate() over these counters
  • Manual CLI check: ethtool -S <ib_interface> | grep rx_err_lane

Heuristics

IB uses FEC (Forward Error Correction) to silently correct physical layer errors. At high rates, FEC gets overwhelmed and link-level retransmissions occur, adding latency to NCCL collectives.

Tier Threshold Interpretation
Critical >1M err/s Likely causing retransmissions
High 300K-1M err/s Possible tail latency
Moderate 100K-300K err/s FEC handles it
Normal <100K err/s Baseline

Thresholds are heuristic based on:

  • Statistical outliers vs cluster median
  • Correlation with observed issues (NCCL hangs, XID errors on high-error nodes)
  • Natural gaps in the data distribution

Better signal: actual retransmissions

port_rcv_remote_physical_errors from perfquery counts cases where FEC failed to correct — these are actual retransmissions and a more direct indicator of link degradation than raw PHY error rates.

# Check IB PHY error counters
ethtool -S <ib_interface> | grep rx_err_lane

# Check retransmission counters (more direct signal)
perfquery -x <lid> | grep RcvRemotePhysErrors

ib_write_bw vs ibdiagnet

These test fundamentally different things. Understanding the difference prevents being fooled by "all links look healthy" when collective performance is degraded.

ib_write_bw ibdiagnet
What it does RDMA write throughput between 2 nodes Fabric-wide topology discovery + hardware counter analysis
Plane Data plane — sends actual data Management plane — reads switch/HCA registers via SMP/GMP
Scope One HCA pair at a time Every port on every switch and HCA in the fabric
Detects Link throughput (GB/s) BER, port errors, cabling mistakes, firmware mismatches, topology issues
Limitation High-BER link still shows full line rate (retransmissions transparent on single stream) Doesn't measure actual data throughput

Key insight: A link with BER 6e-08 (6 orders of magnitude above threshold) still shows 396 Gb/s on ib_write_bw because the hardware retransmits corrupted packets transparently for a single stream. Under collective load with thousands of concurrent flows, every retransmission cascades into congestion. Always run ibdiagnet even if point-to-point bandwidth looks healthy.

ibdiagnet

Fabric-wide diagnostic tool from Mellanox/NVIDIA. Discovers the entire IB fabric topology via Subnet Manager queries, reads hardware counters from every switch and HCA, and compares against known-good baselines.

Running ibdiagnet

ibdiagnet -i mlx5_0 -r --ft --ber_test --rail_validation -sc \
  --pm_pause_time 60 -P all=1 --extended_speeds all \
  --pm_per_lane --get_phy_info --get_cable_info --get_p_info \
  --cable_info_disconnected
Flag What it checks
-i mlx5_0 Start fabric discovery from this HCA port
-r Include routing information
--ber_test Bit Error Rate testing — catches degraded links before they fail
--rail_validation Verify rail-optimized cabling matches expected topology
-sc Subnet configuration checks
--pm_pause_time 60 Collect performance counters over 60s window to detect error accumulation
-P all=1 Reset all port counters first for clean delta measurement
--extended_speeds all Validate all NDR-speed links
--pm_per_lane Per-lane stats — catches single-lane degradation within a multi-lane link
--get_phy_info Physical layer info (signal quality, eye opening)
--get_cable_info Cable type, vendor, length, temperature

What ibdiagnet Catches That Point-to-Point Tests Miss

1. Bit Error Rate (BER) on links

Links with high BER still pass throughput tests because the hardware retransmits transparently. ibdiagnet reads the BER counter directly from the HCA firmware.

  • Threshold: 1e-14 (acceptable)
  • Example bad value: 6e-08 (6 orders of magnitude above threshold)
  • Fix: reseat or replace the cable

2. Port counter errors

Cumulative counters like port_rcv_remote_physical_errors that increment over time. The --pm_pause_time 60 flag with -P all=1 resets counters first, then measures the delta — catching actively degrading ports.

3. Rail cabling mismatches

In a rail-optimized fat-tree, GPU N on every node should connect to the same leaf switch (rail N). --rail_validation compares actual wiring against expected topology. Mismatches break NCCL's topology-aware algorithms (PXN, rail-local routing).

4. Firmware inconsistencies

Detects mixed firmware versions across the fabric. Mismatched firmware between HCAs or switches can cause subtle interop issues under load.

5. Uneven leaf switch downlinks

Reports actual vs expected host port counts per leaf, revealing asymmetric fabric capacity.

Output Location

ibdiagnet writes results to /var/tmp/ibdiagnet2/ by default:

  • ibdiagnet2.log — full diagnostic log
  • ibdiagnet2.db_csv — topology database
  • ibdiagnet2.pm — performance counter data

mlxlink

Per-port diagnostic tool that reads directly from the HCA firmware. Catches signal integrity issues that ibdiagnet may miss.

mlxlink vs ibdiagnet

mlxlink ibdiagnet
Scope Single HCA port Entire fabric
Error counters Cumulative since last reset/reboot Delta over --pm_pause_time window
Sensitivity Shows pre-FEC physical errors (more sensitive) Checks post-FEC symbol BER against threshold
Diagnosis Provides Recommendation (e.g., "Bad signal integrity") Reports pass/fail against thresholds
Counter reset mlxlink --pc (clear port counters) -P all=1 (reset before measurement)

Key insight: A port accumulating ~600 pre-FEC errors/second will be caught by mlxlink but may pass ibdiagnet's BER check — FEC corrects the errors at the physical layer, so the post-FEC symbol BER stays below ibdiagnet's 1e-14 threshold in the 60s window. Always run mlxlink alongside ibdiagnet for signal quality checks.

When to use which

  • ibdiagnet: first pass, fabric-wide health check. Catches topology issues, rail mismatches, firmware mismatches, and ports with high post-FEC BER
  • mlxlink: targeted follow-up on suspect nodes, or sweep all nodes to catch pre-FEC degradation that ibdiagnet misses
  • Both together: after datacenter fixes, to verify repairs. ibdiagnet for fabric-level, mlxlink for per-port signal quality

Running mlxlink across a cluster

ansible compute -i inventory.ini -m shell -a \
  "lspci -nn | grep ConnectX-7 | awk '{print \$1}' | while read bus_id; do
    echo \$bus_id; mlxlink -d \${bus_id} -c 2>/dev/null | grep 'Effective';
  done" --become

Interpreting Effective Physical Errors

These are cumulative pre-FEC errors since last counter reset. What matters is whether they're actively increasing, not the absolute number.

BER Severity Action
0 or 15E-255 Clean No action
1E-17 to 1E-15 Normal Low-level noise, monitor
1E-15 to 1E-13 Elevated Monitor, check if actively increasing
1E-13 to 1E-10 Bad Verify if stale or active; schedule cable check
>= 1E-10 Critical Likely active; verify and escalate for cable replacement

To verify if errors are active vs stale:

# Clear counters
mlxlink -d <bus_id> --pc

# Wait some time, then check
mlxlink -d <bus_id> -c | grep "Effective Physical Errors"

If errors resume immediately after clearing, the port has an active signal integrity issue. If they stay at 0, the errors were historical (e.g., from before a cable fix) and counters just hadn't been reset.

mlxlink troubleshooting flags

Flag Purpose
-c Cable/link info + effective BER
-e Extended error counters
-c -e Both — full picture
--pc Clear port counters (reset to 0)
--port_type PCIE Check PCIe link instead of IB

Diagnostic output to look for

Troubleshooting Info
--------------------
Status Opcode   : 15
Group Opcode    : PHY FW
Recommendation  : Bad signal integrity

When mlxlink shows Bad signal integrity, the firmware itself has diagnosed a hardware problem — cable or transceiver needs replacement.


Troubleshooting Checklist: Cross-Switch Performance Degradation

When NCCL collective performance degrades at scale but point-to-point looks fine:

1. Confirm the problem is cross-switch

Run the same NCCL test on nodes within a single switch unit vs across switch units. If same-SU gets full bandwidth but cross-SU degrades, the problem is in the spine/fabric.

# Same-SU baseline (should get full bandwidth)
mpirun --hostfile /tmp/hostfile-same-su ... all_reduce_perf -b 1G -e 8G -f 2 -g 1

# Cross-SU test (if this degrades, spine is the problem)
mpirun --hostfile /tmp/hostfile-cross-su ... all_reduce_perf -b 1G -e 8G -f 2 -g 1

2. Run pairwise NCCL tests to find outlier nodes

Test each suspect node paired with a known-good node. Outliers with significantly lower bandwidth need further investigation.

3. Run ib_write_bw per-HCA to verify link-level health

# Server side
ib_write_bw -d mlx5_0 --report_gbits -D 10

# Client side
ib_write_bw -d mlx5_0 --report_gbits -D 10 <server_ip>

Repeat for all 8 IB HCAs (mlx5_0 through mlx5_11). All should hit line rate (~396 Gb/s for NDR).

4. Run mlxlink sweep across all nodes

Catches pre-FEC signal degradation that ibdiagnet may miss. Run this before ibdiagnet — it's faster and per-node.

5. Run ibdiagnet for fabric-wide diagnostics

This catches topology, routing, and switch-level issues that per-node mlxlink can't see. Use the full command above.

Check the output for:

  • BER above threshold (1e-14) on any link
  • Port counter errors incrementing during the test window
  • Rail validation failures (nodes on wrong rails)
  • Firmware version mismatches across HCAs and switches
  • Uneven downlink counts on leaf switches

6. Check for SHARP availability

sharp_hello

If it fails with "Failed to connect to Aggregation Manager", SHARP is not enabled. SHARP does in-network reductions on switches, which can bypass spine congestion for collective operations.

7. Try NCCL algorithm tuning as a workaround

NCCL_ALGO=Tree mpirun ... all_reduce_perf -b 1G -e 8G -f 2 -g 1

Tree algorithm often works better on congested fabrics (~24% improvement observed in cross-SU tests).

8. Escalate to provider with evidence

Collect artifacts before escalating:

# GPU/driver bug report
sudo nvidia-bug-report.sh

# Fabric Manager logs (HGX/NVSwitch systems)
journalctl -u nvidia-fabricmanager --since "24 hours ago" > fabmgr.log

# DCGM diagnostics
dcgmi diag -r 3 2>&1 | tee dcgm-diag-r3.log

# ibdiagnet output
ls /var/tmp/ibdiagnet2/

Send to provider:

  • nvidia-bug-report.log.gz — driver state, GPU info, dmesg, XID history
  • Fabric Manager logs (if NVSwitch system)
  • ibdiagnet output files (/var/tmp/ibdiagnet2/)
  • Same-SU vs cross-SU NCCL results showing the gap
  • Specific BER values, port errors, rail mismatches
  • DCGM diag results if GPU hardware is suspect