strace for infra troubleshooting
This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.
strace traces every syscall a process makes. Since almost everything a program does — reading files, spawning threads, talking to the network, allocating memory — goes through the kernel, strace gives you a ground-truth view of what's actually happening at runtime, independent of what the code or docs say.
Useful for:
- Diagnosing slow startup or imports (filesystem latency, lock contention)
- Figuring out which files a program opens (config resolution, library search paths)
- Debugging "works on my machine" failures (missing files, wrong permissions, unexpected paths)
- Understanding network behaviour without source code
- Confirming whether a program is actually sleeping, spinning, or blocked on I/O
Basic usage
# Trace all syscalls, print to stderr
strace <command>
# Follow threads/child processes
strace -f <command>
# Aggregate summary: counts + total time per syscall (no per-call output)
strace -f -c <command>
# Filter to specific syscalls only
strace -e trace=openat,read <command>
# Attach to a running process
strace -p <pid>For profiling imports specifically — run the binary directly, not through a wrapper like uv run, otherwise you profile the launcher waiting for the child, not the import itself.
Case study: slow filesystem (Python import)
Reading the output
strace -c prints a table sorted by time:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- --------
98.71 376.123456 1481 254012 futex
0.91 3.470000 12 289450 openat
0.38 1.450000 5 290000 readCommon mistake: futex dominating (>90%) looks like a locking/threading problem, not a filesystem problem. It often IS a filesystem problem — just indirectly.
Import lock amplification
Python's import system uses a per-module lock (a futex). When many threads import the same module:
- Thread A acquires the import lock, reads the file from disk — holds lock while reading
- All other threads (B, C, D...) block on
futexwaiting for the lock - Thread A finishes → releases → next thread acquires → reads → rest block again
- This cascades across every module in the import graph
strace -f -c sums time across all threads. So 254 threads each waiting 1.5s on a lock shows as ~381s of aggregated futex time — even though wall time is much shorter. A slow filesystem makes Thread A hold the lock longer, which amplifies the wait for every other thread.
The filesystem is the root cause, but it appears as futex time.
Distinguishing causes
| strace output | Likely cause |
|---|---|
futex >> everything, filesystem syscalls tiny |
Import lock amplification from slow reads, OR genuine Python/CUDA init |
openat/stat high count + high time |
Direct filesystem bottleneck (no page cache, high per-call latency) |
read high time |
Slow file data reads — storage or page cache issue |
mmap/mprotect high |
Shared object loading — not filesystem latency |
To tell import lock amplification apart from genuine Python/CUDA overhead:
# Copy venv to local fast storage, rerun
cp -r /slow-fs/.venv /tmp/test-venv
strace -f -c /tmp/test-venv/bin/python3 -c "import slow.module"If wall time drops dramatically → filesystem was the cause. If wall time stays the same → genuine Python/CUDA init overhead.
Syscall crash course
Knowing what each syscall does lets you read strace output without guessing.
Filesystem
| Syscall | What it does | High count/time means |
|---|---|---|
openat |
Open a file | Many small files being opened (e.g. Python .py and .pyc files per import) |
read |
Read bytes from a file descriptor | File data reads are slow — storage or cache miss |
write |
Write bytes to a file descriptor | Write throughput bottleneck |
close |
Close a file descriptor | Usually cheap; high count confirms many opens |
stat / fstat / newfstatat |
Get file metadata (size, mtime, permissions) | Filesystem doing many metadata lookups — common in Python import path resolution |
getdents64 |
Read directory entries | Directory listing; slow on large or remote directories |
lseek |
Move file offset | Repositioning within a file |
Memory
| Syscall | What it does | High count/time means |
|---|---|---|
mmap |
Map file or anonymous memory | Library (.so) loading, memory allocation |
munmap |
Unmap memory | Cleanup of mappings |
mprotect |
Change memory region permissions | Part of shared library loading — set segment permissions after mapping |
brk |
Extend/shrink heap | Heap allocation via malloc |
Threading / synchronisation
| Syscall | What it does | High count/time means |
|---|---|---|
futex |
Fast userspace mutex (kernel arbitrates contention) | Lock contention — many threads waiting. Often caused by a slow operation holding a lock, not the lock itself |
clone |
Create thread or process | High thread spawn rate |
Process
| Syscall | What it does | High count/time means |
|---|---|---|
execve |
Execute a program | Process launch — slow if filesystem is slow |
wait4 / waitid |
Wait for child process to exit | Parent blocked on child |
exit_group |
Exit all threads in process | Normal shutdown |
Networking / IPC
| Syscall | What it does | High count/time means |
|---|---|---|
socket |
Create a socket | Network or Unix socket setup |
connect |
Connect to an address | Outbound connection — high time = connection latency or refusal |
recvfrom / recvmsg |
Receive data | Network read blocked or slow |
sendto / sendmsg |
Send data | Network write |
epoll_wait / poll / select |
Wait for I/O events | Event loop idle time — high time is normal for servers waiting on clients |
ioctl |
Device control | Driver-level operations; common in GPU/CUDA and network device paths |
Timing
| Syscall | What it does | High count/time means |
|---|---|---|
nanosleep / clock_nanosleep |
Sleep for a duration | Deliberate sleeps or backoff loops in the program |
clock_gettime |
Read a clock | Profiling or timeout checks inside the program; very cheap |
Quick pattern guide
- Slow startup, many
openat+stat→ filesystem latency on metadata ops. Common with network filesystems and Python imports. futexdominates, filesystem syscalls are fast → lock contention amplified by slow I/O (see import lock amplification above), or genuine thread synchronisation.mmap+mprotecthigh time → shared library loading from slow storage.epoll_waitdominates → program is idle waiting for network — usually expected.nanosleepunexpectedly high → check for retry loops or deliberate throttling in the code.
Caveats
strace -fadds overhead — wall times will be longer than uninstrumented. Use for relative comparisons only.- CUDA init (
libcuda.soloading) contributes a fixed 3–8s baseline regardless of filesystem. - Lock amplification scales with thread count. Single-threaded imports won't show this pattern.