Supercomputing Skills to Learn
This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.
infrastructuredistributed-systemsmllearning
Skills and knowledge areas worth investing in for GPU supercomputing and large-scale ML infrastructure work.
Core Systems
- Linux internals — process management, memory, filesystems, kernel tuning for high-throughput workloads
- Networking — TCP/IP, RDMA, InfiniBand, NVLink; understanding how interconnects affect distributed training bandwidth
- Infrastructure as Code — Terraform, Ansible, or similar for repeatable cluster provisioning
Cluster Orchestration
- Kubernetes — workload scheduling, resource management, custom operators; topology-aware placement for GPU pods
- Slurm — how HPC schedulers work, job queues, multi-tenant fairness policies
- Cluster provisioning — imaging nodes, capacity planning, lifecycle management at scale
GPU and Distributed Training
- CUDA — GPU memory model, kernel execution, profiling; understanding where bottlenecks come from
- NCCL — collective communication primitives (AllReduce, AllGather); how they map to ring/tree topologies
- Distributed training — data parallelism, tensor parallelism, pipeline parallelism; how frameworks implement them
- PyTorch distributed —
torch.distributed, FSDP, DeepSpeed; practical experience running multi-node jobs
Storage
- Parallel filesystems — Lustre, GPFS, or NFS at scale; tuning for checkpoint writes and dataset reads
- Object storage — S3-compatible systems for dataset and artifact management
- Retention and tiering — how to manage large volumes of checkpoints without blowing up storage costs
Observability
- GPU metrics — DCGM, nvidia-smi, understanding utilization vs. memory-bound bottlenecks
- Distributed tracing — correlating slowdowns across nodes in a training run
- Alerting and SLOs — defining and acting on reliability targets for training infrastructure
Language Depth
- Python — async I/O, C extensions, profiling; writing performance-sensitive control-plane code
- Rust — memory safety, async with Tokio, FFI to C; useful for low-latency infrastructure tooling
What to Prioritize
Start with Kubernetes and distributed training fundamentals — they're the intersection of most of the above. Then go deeper on CUDA/NCCL once you have enough context to understand why the low-level primitives matter.