GPU Infrastructure Troubleshooting
This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.
Runbooks and diagnostic procedures for GPU cluster performance issues.
Node-Level: IOMMU, ACS, PCIe
IOMMU
Sits between PCIe devices and system memory, translating device-virtual to physical addresses. Every DMA transaction goes through translation.
Normal: GPU --[DMA]--> Physical Memory
With IOMMU: GPU --[DMA]--> IOMMU --[translate IOVA->PA]--> Physical MemoryImpact: 2x-10x degradation in AllReduce bandwidth (IOTLB thrashing, ACS routing penalty, GPUDirect RDMA overhead).
How to Check
1. IOMMU groups count (most reliable)
ls /sys/kernel/iommu_groups/ | wc -l
# 0 = disabled (good), > 0 = active (bad for GPU perf)2. IOMMU domain type
cat /sys/kernel/iommu_groups/*/type 2>/dev/null | sort | uniq -c
# DMA-FQ or DMA = full translation (worst), identity = passthrough (less bad)3. Kernel command line
cat /proc/cmdline | tr ' ' '\n' | grep -i iommu4. Per-GPU IOMMU status
for gpu in $(lspci -d 10de: -D | awk '{print $1}'); do
group=$(readlink /sys/bus/pci/devices/$gpu/iommu_group 2>/dev/null | xargs basename 2>/dev/null)
if [ -n "$group" ]; then
dtype=$(cat /sys/kernel/iommu_groups/$group/type 2>/dev/null)
echo "$gpu -> IOMMU group $group ($dtype) $(lspci -s $gpu)"
fi
doneFix: Disable IOMMU
Option 1: Kernel boot parameter (preferred)
# Intel:
GRUB_CMDLINE_LINUX="intel_iommu=off"
# AMD:
GRUB_CMDLINE_LINUX="amd_iommu=off"
sudo update-grub && sudo rebootOption 2: Passthrough mode (if IOMMU must stay on)
# Intel:
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# AMD:
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"Option 3: BIOS — Disable VT-d (Intel) or AMD-Vi (AMD). Most thorough but requires physical/IPMI access.
Real-world case: H200 cluster, PXE-booted
- PXE-booted nodes missing
intel_iommu=on iommu=pt - NCCL all-reduce busbw capped at ~98-101 GB/s (target: ~180 GB/s algbw / ~337 GB/s busbw for 16-GPU)
- After adding
intel_iommu=on iommu=pt iomem=relaxed pci=bfsort,realloc=off rd.driver.blacklist=nouveau: - busbw jumped ~103 GB/s -> ~484 GB/s
iommu=pt(passthrough mode) was the key change enabling GPUDirect RDMA to bypass CPU
PXE-booted clusters: kernel params set by provisioning server, not
/etc/default/grub. Check boot history:sudo journalctl -b -N -k | grep "Command line:"across boots.
ACS (Access Control Services)
Controls whether P2P traffic passes through PCIe switch directly or routes up to root complex.
Normal: GPU0 --[PCIe switch]--> GPU1 (direct)
With ACS: GPU0 --[up to root complex]--> GPU1 (detour)Often auto-enabled by kernel when IOMMU is active — IOMMU has double penalty: translation overhead + ACS forcing suboptimal routing.
How to Check
echo "ACS devices total: $(sudo lspci -vvv 2>/dev/null | grep -c 'ACSCtl:')"
echo "ACS with SrcValid+: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'SrcValid+')"
echo "ACS with ReqRedir+: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'ReqRedir+')"
echo "ACS all bits off: $(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-')"Key: ReqRedir+ is the perf killer — forces all P2P through root complex.
Quick fleet check:
echo "$(hostname): acs_redir=$(sudo lspci -vvv 2>/dev/null | grep 'ACSCtl:' | grep -c 'ReqRedir+')"Fix: Disable ACS
for BDF in $(lspci -d '*:*' | awk '{print $1}'); do
sudo setpci -s $BDF ECAP_ACS+6.w=0000 2>/dev/null
donePPCIe (Protected PCIe)
Part of NVIDIA's Confidential Computing stack. Encrypts data on PCIe bus. Some cloud providers enable by default.
- Intra-node: Minimal impact (NVLink bypasses PCIe)
- Inter-node: Overhead on PCIe segments in
GPU -> NVSwitch -> PCIe -> NICpath
Fix: Disable PPCIe
Uses NVIDIA gpu-admin-tools:
python3 ~/gpu-admin-tools/nvidia_gpu_tools.py --devices gpus --set-ppcie-mode=off --reset-after-ppcie-mode-switch
python3 ~/gpu-admin-tools/nvidia_gpu_tools.py --devices nvswitches --set-ppcie-mode=off --reset-after-ppcie-mode-switch
rebootFleet Diagnostics
# Ansible
ansible compute -i inventory.ini -m shell -a \
'echo "$(hostname): iommu_groups=$(ls /sys/kernel/iommu_groups/ 2>/dev/null | wc -l) acs_redir=$(sudo lspci -vvv 2>/dev/null | grep ACSCtl: | grep -c ReqRedir+)"'
# pssh
pssh -h hosts.txt -i \
'echo "$(hostname): iommu_groups=$(ls /sys/kernel/iommu_groups/ 2>/dev/null | wc -l) acs_redir=$(sudo lspci -vvv 2>/dev/null | grep ACSCtl: | grep -c ReqRedir+)"'Post-fix Verification
dmesg | grep -i iommu
ls /sys/kernel/iommu_groups/
nvidia-smi topo -p2p r
nccl-tests/build/all_reduce_perf -b 8 -e 256M -f 2 -g 8Recommended Settings for H100/H200 Clusters
| Setting | Value | How |
|---|---|---|
| IOMMU (BIOS) | Disabled | BIOS > VT-d/AMD-Vi > Disabled |
| IOMMU (kernel) | intel_iommu=off or amd_iommu=off |
/etc/default/grub |
| PCIe ACS | Disabled on GPU bridges | setpci or BIOS |
| PPCIe mode | Disabled (off) | nvidia_gpu_tools.py --set-ppcie-mode=off on GPUs + NVSwitches |
| nvidia-peermem | Loaded | modprobe nvidia-peermem |
| Persistence mode | Enabled | nvidia-smi -pm 1 |
Node-Level: Pre-Job Health Checks
Systematic verification before launching workloads. Can be scripted into Slurm prolog.
Quick Checklist
# 1. Driver and modules
nvidia-smi > /dev/null && echo "driver: ok" || echo "driver: MISSING"
lsmod | grep -q nvidia_peermem && echo "peermem: ok" || echo "peermem: MISSING"
lsmod | grep -q gdrdrv && echo "gdrdrv: ok" || echo "gdrdrv: MISSING (optional)"
# 2. Persistence mode
nvidia-smi -q -d PERFORMANCE | grep -i "Persistence" | head -1
# 3. Fabric Manager (HGX/NVSwitch)
systemctl is-active nvidia-fabricmanager
# 4. GPU count
echo "GPUs: $(nvidia-smi -L | wc -l)"
# 5. PCIe link speed (Gen5 x16 for H100/H200)
nvidia-smi -q -d PERFORMANCE | grep -E "Link (Gen|Width)"
# 6. NVLink topology
nvidia-smi nvlink -s
# 7. Recent XID errors
dmesg | grep -i "Xid" | tail -5 || echo "clean"
# 8. ECC errors and page retirement
nvidia-smi -q -d ECC | grep -E "Uncorrected|Corrected" | grep -v ": 0"
nvidia-smi -q -d PAGE_RETIREMENT | grep -E "Pending|Cause"
# 9. DCGM diagnostic
dcgmi diag -r 2DCGM Diagnostic Levels
GPUs must be idle.
| Level | Command | Duration | Tests |
|---|---|---|---|
| 1 (quick) | dcgmi diag -r 1 |
~seconds | Deployment: driver, NVML, persistence mode, GPU count, permissions |
| 2 (medium) | dcgmi diag -r 2 |
~2 min | Level 1 + PCIe bandwidth, targeted stress/power. Catches degraded links. |
| 3 (long) | dcgmi diag -r 3 |
~15 min | Level 2 + full memory scan, diagnostic plugin. Catches intermittent memory errors. |
Level 2 = sweet spot for pre-job checks. Fast enough for Slurm prolog, thorough enough for most hardware issues.
Example Slurm prolog:
#!/bin/bash
dcgmi diag -r 2 > /tmp/dcgm-prolog-$(hostname).log 2>&1
if [ $? -ne 0 ]; then
scontrol update nodename=$(hostname) state=drain reason="dcgmi diag failed"
exit 1
fiFleet-wide pre-job check
ansible compute -i inventory.ini --become -m shell -a \
'nvidia-smi > /dev/null && echo "driver: ok" || echo "driver: MISSING"; \
echo "GPUs: $(nvidia-smi -L | wc -l)"; \
echo "peermem: $(lsmod | grep -q nvidia_peermem && echo ok || echo MISSING)"; \
echo "fabmgr: $(systemctl is-active nvidia-fabricmanager)"; \
echo "XIDs: $(dmesg | grep -ci Xid)"'Node-Level: Proactive Monitoring
Key metrics to collect continuously (DCGM, dcgm-exporter, or custom scripts):
| Metric | Source | Why |
|---|---|---|
| GPU temperature | nvidia-smi -q -d TEMPERATURE |
Thermal throttling degrades perf silently |
| Memory temperature | DCGM DCGM_FI_DEV_MEMORY_TEMP |
HBM throttles independently |
| Power draw | nvidia-smi -q -d POWER |
Hitting power cap = clock throttling |
| Clock speeds | nvidia-smi -q -d CLOCK |
Lower than max = throttling active |
| ECC error counts | nvidia-smi -q -d ECC |
Rising corrected errors = failing memory |
| NVLink errors | nvidia-smi nvlink -e |
Link degradation before failure |
| PCIe replay count | DCGM DCGM_FI_DEV_PCIE_REPLAY_COUNTER |
Rising = signal integrity issue |
| XID error count | DCGM DCGM_EXP_XID_ERRORS_COUNT |
Alert on any XID >= 48 |
Fabric-Level: InfiniBand Diagnostics
IB PHY Error Monitoring
Metrics:
rx_err_lane_*_phyvia ethtool (per-lane physical layer errors)- Exposed via Grafana Alloy ethtool collector as
node_net_ethtool_received_err_lane_*_phy - Manual:
ethtool -S <ib_interface> | grep rx_err_lane
Heuristics (IB uses FEC to correct errors; at high rates, FEC overwhelmed -> retransmissions -> latency):
| Tier | Threshold | Interpretation |
|---|---|---|
| Critical | >1M err/s | Likely causing retransmissions |
| High | 300K-1M err/s | Possible tail latency |
| Moderate | 100K-300K err/s | FEC handles it |
| Normal | <100K err/s | Baseline |
Better signal: port_rcv_remote_physical_errors from perfquery — counts cases where FEC failed to correct (actual retransmissions).
ethtool -S <ib_interface> | grep rx_err_lane
perfquery -x <lid> | grep RcvRemotePhysErrorsib_write_bw vs ibdiagnet
| ib_write_bw | ibdiagnet | |
|---|---|---|
| What | RDMA write throughput between 2 nodes | Fabric-wide topology + hardware counter analysis |
| Plane | Data plane | Management plane (SMP/GMP) |
| Scope | One HCA pair | Every port on every switch/HCA |
| Detects | Link throughput (GB/s) | BER, port errors, cabling, firmware mismatches, topology |
| Limitation | High-BER link still shows full line rate (retransmissions transparent on single stream) | Doesn't measure actual throughput |
Key insight: A link with BER 6e-08 still shows 396 Gb/s on ib_write_bw because hardware retransmits transparently for a single stream. Under collective load, every retransmission cascades. Always run ibdiagnet even if point-to-point looks healthy.
ibdiagnet
Fabric-wide diagnostic. Discovers IB fabric topology via SM queries, reads hardware counters, compares against baselines.
ibdiagnet -i mlx5_0 -r --ft --ber_test --rail_validation -sc \
--pm_pause_time 60 -P all=1 --extended_speeds all \
--pm_per_lane --get_phy_info --get_cable_info --get_p_info \
--cable_info_disconnected| Flag | What |
|---|---|
-i mlx5_0 |
Start discovery from this HCA |
-r |
Include routing info |
--ber_test |
BER testing — catches degraded links pre-failure |
--rail_validation |
Verify rail-optimized cabling matches expected topology |
-sc |
Subnet config checks |
--pm_pause_time 60 |
Collect perf counters over 60s window |
-P all=1 |
Reset all port counters first |
--extended_speeds all |
Validate all NDR links |
--pm_per_lane |
Per-lane stats — catches single-lane degradation |
--get_phy_info |
Signal quality, eye opening |
--get_cable_info |
Cable type, vendor, length, temperature |
What ibdiagnet catches that point-to-point misses
- BER on links — high BER passes throughput tests (hardware retransmits transparently). Threshold: 1e-14. Bad: 6e-08. Fix: reseat/replace cable.
- Port counter errors — cumulative counters;
--pm_pause_time 60+-P all=1resets then measures delta - Rail cabling mismatches —
--rail_validationcompares actual wiring vs expected. Mismatches break NCCL topology-aware algorithms (PXN, rail-local routing). - Firmware inconsistencies — mixed versions cause subtle interop issues under load
- Uneven leaf switch downlinks — asymmetric fabric capacity
Output: /var/tmp/ibdiagnet2/ — ibdiagnet2.log, ibdiagnet2.db_csv, ibdiagnet2.pm
mlxlink
Per-port diagnostic reading directly from HCA firmware. Catches signal integrity issues ibdiagnet may miss.
| mlxlink | ibdiagnet | |
|---|---|---|
| Scope | Single HCA port | Entire fabric |
| Counters | Cumulative since last reset | Delta over --pm_pause_time |
| Sensitivity | Pre-FEC physical errors (more sensitive) | Post-FEC symbol BER |
| Diagnosis | Provides Recommendation (e.g., "Bad signal integrity") |
Pass/fail against thresholds |
| Reset | mlxlink --pc |
-P all=1 |
Key insight: ~600 pre-FEC errors/sec caught by mlxlink but may pass ibdiagnet's BER check (FEC corrects at physical layer). Always run mlxlink alongside ibdiagnet.
When to use:
- ibdiagnet — first pass, fabric-wide. Catches topology, rails, firmware, high post-FEC BER.
- mlxlink — targeted follow-up, or sweep all nodes for pre-FEC degradation
- Both — after datacenter fixes to verify repairs
Running mlxlink across cluster
ansible compute -i inventory.ini -m shell -a \
"lspci -nn | grep ConnectX-7 | awk '{print \$1}' | while read bus_id; do
echo \$bus_id; mlxlink -d \${bus_id} -c 2>/dev/null | grep 'Effective';
done" --becomeInterpreting Effective Physical Errors
Cumulative pre-FEC errors since last counter reset. What matters: actively increasing, not absolute number.
| BER | Severity | Action |
|---|---|---|
| 0 or 15E-255 | Clean | None |
| 1E-17 to 1E-15 | Normal | Low-level noise, monitor |
| 1E-15 to 1E-13 | Elevated | Monitor, check if increasing |
| 1E-13 to 1E-10 | Bad | Verify stale vs active; schedule cable check |
| >= 1E-10 | Critical | Likely active; escalate for cable replacement |
Verify active vs stale:
mlxlink -d <bus_id> --pc # clear counters
# wait, then:
mlxlink -d <bus_id> -c | grep "Effective Physical Errors"Errors resume immediately = active issue. Stay at 0 = historical.
mlxlink flags
| Flag | Purpose |
|---|---|
-c |
Cable/link info + effective BER |
-e |
Extended error counters |
-c -e |
Both |
--pc |
Clear port counters |
--port_type PCIE |
Check PCIe link |
When mlxlink shows Bad signal integrity, firmware has diagnosed a hardware problem — cable/transceiver needs replacement.
Checklist: Cross-Switch Performance Degradation
When NCCL collective performance degrades at scale but point-to-point looks fine:
1. Confirm problem is cross-switch
# Same-SU baseline (should get full bandwidth)
mpirun --hostfile /tmp/hostfile-same-su ... all_reduce_perf -b 1G -e 8G -f 2 -g 1
# Cross-SU test (if degrades, spine is the problem)
mpirun --hostfile /tmp/hostfile-cross-su ... all_reduce_perf -b 1G -e 8G -f 2 -g 12. Pairwise NCCL tests — each suspect node paired with known-good node. Outliers with lower bandwidth need investigation.
3. Per-HCA ib_write_bw
# Server: ib_write_bw -d mlx5_0 --report_gbits -D 10
# Client: ib_write_bw -d mlx5_0 --report_gbits -D 10 <server_ip>Repeat for all 8 IB HCAs (mlx5_0 through mlx5_11). All should hit ~396 Gb/s (NDR).
4. mlxlink sweep — catches pre-FEC signal degradation. Faster than ibdiagnet, per-node.
5. ibdiagnet — fabric-wide diagnostics. Check output for:
- BER above 1e-14
- Port counter errors incrementing
- Rail validation failures
- Firmware version mismatches
- Uneven downlink counts on leaf switches
6. Check SHARP
sharp_hello
# Fails with "Failed to connect to Aggregation Manager" = not enabledSHARP does in-network reductions, can bypass spine congestion.
7. NCCL algorithm tuning (workaround)
NCCL_ALGO=Tree mpirun ... all_reduce_perf -b 1G -e 8G -f 2 -g 1Tree often works better on congested fabrics (~24% improvement observed).
8. Escalate with evidence
sudo nvidia-bug-report.sh
journalctl -u nvidia-fabricmanager --since "24 hours ago" > fabmgr.log
dcgmi diag -r 3 2>&1 | tee dcgm-diag-r3.log
ls /var/tmp/ibdiagnet2/Send: nvidia-bug-report, fabmgr logs, ibdiagnet output, same-SU vs cross-SU results, BER values, port errors, rail mismatches, DCGM diag.