Kubernetes Controller Anti-patterns: What Actually Costs You Performance
This is an experiment — raw data and observations, not a polished write-up.
Four controller best-practice guidelines tested against their bad counterparts on a local Kind cluster: does GenerationChangedPredicate matter, does MaxConcurrentReconciles help, does r.Status().Patch() break the conflict bottleneck, and do recent k8s feature gates reduce API server pressure? Each factor gets its own controlled experiment with Prometheus metrics collected per-second to CSV.
Code: pokgak/pokgak.github.io — content/experiments/k8s-controller-benchmark
The Question
Controller best practices are well documented but rarely quantified. Four concrete questions:
- How bad does it get if you skip
GenerationChangedPredicateand write annotations on every reconcile? - Where does
MaxConcurrentReconciles: 5actually pay off over a single worker? - Does switching from
r.Status().Update()+RetryOnConflicttor.Status().Patch()break the conflict bottleneck? - Do
ConcurrentWatchObjectDecodeandWatchListClient(k8s 1.34 feature gates) reduce the API server pressure that causes the bottleneck?
Setup
- Cluster: Kind v1.34.0, single-node, local Docker (OrbStack)
- CRD:
Widget— spec: count, message; status: phase, processedCount, lastUpdated - Simulated work: 10ms sleep per reconcile (represents I/O-bound external call)
- Metrics: Prometheus endpoint scraped to CSV every 1s via a Go scraper binary; key metrics:
workqueue_depth,workqueue_adds_total,workqueue_retries_total,controller_runtime_reconcile_total{result},controller_runtime_reconcile_time_seconds - Framework: controller-runtime v0.18.4, kubebuilder v4
- Debug logging: enabled on all controllers (zapcore.DebugLevel)
- Observation window: 60s for Experiment 1; 180s for Experiments 2–4
Each controller variant watches its own namespace so all variants can run concurrently on the same cluster without interfering.
Experiment 1: Anti-patterns vs Best Practices
Why this matters
Before measuring performance, the correctness question has to be answered first: do these controller patterns actually change whether objects converge, not just how fast? The two most-missed patterns in real operator code are the GenerationChangedPredicate and the status subresource distinction. This experiment isolates them in a controlled 2×2 matrix.
Hypothesis
- Predicate dominates correctness. Bad controllers never converge to Ready regardless of worker count — the annotation-write loop creates a permanent reconcile backlog.
- Predicate dominates event volume. Bad controller reconcile counts grow without bound; good controllers stabilize after the first pass.
- More workers amplifies the loop. Five concurrent workers racing to stamp the same annotation produce more
resourceVersionconflicts than a single worker, generating more retries and more wasted work. The 1-worker bad controller is throttled by its own serialization.
Method
Four controller variants running concurrently in separate namespaces:
| 1 worker | 5 workers | |
|---|---|---|
| Predicate ON | good-single |
good |
| Predicate OFF | bad-fixed-single |
bad-fixed-status |
All four variants use r.Status().Update() correctly (status subresource enabled), isolating the predicate and worker count as the only variables.
The "bad" variants stamp a nanosecond timestamp annotation on every reconcile via r.Update(). Without the predicate, each annotation write bumps resourceVersion, fires a watch event, and re-triggers reconcile — a guaranteed infinite loop.
N = {50, 200, 500, 1000} Widget objects created concurrently in all namespaces. 60-second observation window. Reconcile counts from controller logs; Ready counts from kubectl get widgets.
Results
Reconcile counts and Ready status after 60s:
| N | good Ready | bad-fix Ready | bad-fix (5w) lat | bad-fix (5w) retries | bad-fix-single (1w) lat | bad-fix-single (1w) retries |
|---|---|---|---|---|---|---|
| 50 | 50/50 ✓ | 0/50 ✗ | 465ms | 7 | 94ms | 3 |
| 200 | 200/200 ✓ | 0/200 ✗ | 465ms | 7 | 94ms | 3 |
| 500 | 500/500 ✓ | 0/500 ✗ | 445ms | 11 | 91ms | 6 |
| 1,000 | 1000/1000 ✓ | 0/1000 ✗ | 430ms | 20 | 89ms | 4 |
| 5,000 | 5000/5000 ✓ | ~1940/5000 ✗ | 214ms | 41 | 46ms | 33 |
Time to full convergence (good-5w controller within 60s window):
| N | Converged? |
|---|---|
| 50 | ✓ ~5s |
| 200 | ✓ ~11s |
| 500 | ✓ ~21s |
| 1,000 | ✓ ~39s |
| 5,000 | ✓ (within 60s window) |
Prometheus metrics at N=5000 end-of-window:
| Controller | queue_depth | retries | errors | avg_lat |
|---|---|---|---|---|
| good (5w) | 0 ✓ | 33 | 33 | 130ms |
| good-single (1w) | 0 ✓ | 92 | 92 | 27ms |
| bad-fix (5w) | 4,999 ✗ | 41 | 41 | 214ms |
| bad-fix-single (1w) | 5,000 ✗ | 33 | 33 | 46ms |
What this tells us
The correctness prediction was fully confirmed. Bad controllers never reach Ready at any scale because r.Update() on a CRD with subresources: status: {} silently strips status fields at the API server. The controller sets widget.Status.Phase = "Ready" in memory, calls r.Update(), gets a 200 OK back, and the write is discarded. No error, no log, no convergence.
The loop prediction was confirmed. Without GenerationChangedPredicate, every annotation write bumps resourceVersion → watch event → reconcile → annotation write → loop. At N=200 the bad-5w controller produced 4,650 reconcile events in 60 seconds (23/s per 200 objects) while the good controller produced 400 and went silent.
The "more workers amplifies the loop" prediction was also confirmed — but in a specific way. The 5-worker bad controller produced fewer reconciles than the 1-worker at large N (12,683 vs 5,856 at N=1000). The reason: with 5 workers racing to annotate the same object pool, they produce more resourceVersion conflicts (7 errors vs 3). Each conflict triggers a rate-limiter backoff — the loop is actually slower under contention. The 1-worker controller serializes cleanly with no conflicts, achieving higher raw event throughput despite being bottlenecked to ~100/s.
Average reconcile latency confirms this: bad-fix-5w at 465ms vs bad-fix-single at 94ms. Five concurrent annotation writes on the same 200 objects create 5× more API server pressure per object, inflating per-write latency.
Connection to real frameworks: This maps directly to how kopf (the Python controller framework) works internally — it writes handler progress and state into annotations on every handler invocation. Without GenerationChangedPredicate's equivalent, every annotation write re-enters the handler. At low object counts the loop is fast and handlers are idempotent, so it's invisible. At 1000+ objects with concurrent external changes (e.g., Pod status updates), the annotation cascade grows faster than the handler queue can drain — the production bogging-down at ~1000 pods described in the setup.
This experiment eliminated the bad variants from further investigation. Experiments 2–4 focus only on the good patterns.
Experiment 2: When Does MaxConcurrentReconciles Pay Off?
Why this matters
Experiment 1 showed the 5-worker good controller converges faster at modest N. The natural follow-up: does the benefit scale? Is there a regime where 5 workers saturates the workload, converges faster, and then the advantage disappears? Or is there actually a regime where 5 workers actively hurts a correctly-written controller?
Hypothesis
5 workers should produce ~5× faster convergence at low N where the bottleneck is worker throughput. At very large N (10k+), both configurations should saturate at the same throughput ceiling and produce the same number of reconciles per unit time.
Method
Only the two good variants (good 5w, good-single 1w). Bad controllers scaled to 0 replicas to eliminate API server noise. 180-second observation window. N = {1,000; 2,000; 10,000}.
Results
| N | good-1w lat | good-1w errors | good-5w lat | good-5w errors | Converged? |
|---|---|---|---|---|---|
| 1,000 | 32ms | 0 | 124ms | 17 | ✓ both |
| 2,000 | 37ms | 0 | 164ms | 13 | ✓ both |
| 10,000 | 50ms | 0 | 248ms | 0 | ✗ both (~5,600 remaining at 180s) |
N=5,000 was not run in the isolated good-only mode; Experiment 1 shows both good variants converge within 60s at N=5k (queue=0 at window close).
At N=10k both controllers drain at ~200 objects/10s — identical throughput. Queue at 180s: good-single 5,614; good 5,610.
What this tells us
The hypothesis was wrong. 5 workers never beats 1 worker at any tested scale on this cluster. At every N:
- The 5-worker controller has 4–5× higher per-reconcile latency
- It produces 13–17 conflict errors at small N (zero on the 1-worker version)
- It processes fewer or equal total reconciles per unit time
The explanation: with 5 concurrent reconcilers each issuing r.Status().Update(), all 5 goroutines hit the Kind API server simultaneously. The single-node Kind API server (etcd + kube-apiserver in one Docker container) serializes writes internally — etcd handles one write at a time. The result is not 5× parallelism; it's 5 goroutines queued behind each other with higher per-write latency, producing the same aggregate throughput.
At N=10,000 the ceiling is explicit: ~24 writes/sec, achieved equally by both configurations. The 5-worker controller at 248ms avg × 5 workers = theoretical 20/s; the 1-worker at 50ms = theoretical 20/s. They're the same calculation, confirming the API server is the shared wall.
The 17 errors at N=1k (vs 0 at N=10k) have a distinct cause: at small queue depth, all 5 workers can land on the same object simultaneously — one worker holds a fresh resourceVersion and writes; the others hold a stale copy and get a 409 Conflict. At N=10k the queue contains 10,000 distinct objects, so workers almost never collide.
For production: MaxConcurrentReconciles matters when reconcile work is CPU-bound or calls external APIs (not the Kubernetes API server) — work that genuinely parallelizes. On a single-node development cluster or CI Kind cluster, 1 worker is optimal. On a real multi-node HA cluster with distributed etcd, more workers should show genuine throughput gains since writes don't serialize on one node.
This experiment raises the question for Experiment 3: if conflicts are the visible cost of 5 workers at small N, does eliminating conflicts (via r.Status().Patch()) let 5 workers finally pay off?
Experiment 3: Does r.Status().Patch() Move the Ceiling?
Why this matters
Experiment 2's main cost at small N was conflicts: 5 workers racing on the same resourceVersion produce 409 errors that trigger RetryOnConflict retries. r.Status().Patch() with a merge patch eliminates the conflict entirely — no resourceVersion matching required. If conflicts were the bottleneck, switching to Patch should break the ~24/s ceiling. If etcd write serialization is the bottleneck, Patch removes noise but doesn't move the ceiling.
Hypothesis
- Patch eliminates
resourceVersionconflicts at all scales (zero retries, zero errors). - If retries were the bottleneck: latency drops, throughput rises above the 24/s ceiling.
- If etcd serialization is the bottleneck: conflicts gone, ceiling unchanged.
Method
Two new controller variants: good-patch (5w + r.Status().Patch()), good-single-patch (1w + r.Status().Patch()). All other patterns identical to good and good-single. client.MergeFrom merge patch — no RetryOnConflict wrapper needed.
// Before (Update + RetryOnConflict):
err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
current := &Widget{}
r.Get(ctx, req.NamespacedName, current) // extra GET on each retry
current.Status.Phase = "Ready"
return r.Status().Update(ctx, current)
})
// After (Patch — no retry needed):
base := widget.DeepCopy()
widget.Status.Phase = "Ready"
r.Status().Patch(ctx, widget, client.MergeFrom(base))N = {1,000; 2,000; 5,000; 10,000}. 180-second window.
Results
| N | Method | 5w lat | 5w retries | 1w lat | 1w retries | Converged? |
|---|---|---|---|---|---|---|
| 1k | Update | 124ms | 17 | 32ms | 0 | ✓ both |
| 1k | Patch | 242ms | 0 | 49ms | 0 | ✓ both |
| 2k | Update | 164ms | 13 | 37ms | 0 | ✓ both |
| 2k | Patch | 246ms | 0 | 49ms | 0 | ✓ both |
| 5k | Patch | 165ms | 16 | 35ms | 6 | ✗ (~1k remaining at 180s) |
| 10k | Update | 248ms | 0 | 50ms | 0 | ✗ (~5.6k remaining at 180s) |
| 10k | Patch | 248ms | 0 | 50ms | 0 | ✗ (~5.6k remaining at 180s) |
What this tells us
Conflicts: eliminated ✓ Zero retries at both scales including N=1k where Update had 17.
Throughput ceiling: unchanged. Both variants still hit ~24 reconciles/sec at N=10k. The ceiling is etcd write serialization, not retry overhead. Patch removes noise but not the fundamental constraint.
Patch is slower at small N for the 5-worker variant — 242ms vs 124ms at N=1k. The server-side JSON merge processing adds latency that the plain Update path didn't have. This shows up when the API server isn't already saturated with queued writes. At N=10k the serialization wait dominates and both methods converge to ~248ms.
The 1-worker variant barely changed — 49ms vs 32ms (Patch slightly slower). The 1-worker controller never had conflicts to begin with, so Patch adds overhead without removing any existing wasted work.
Note: Patch at N=10k produced 4,365 successes vs Update's 4,385 — Patch is marginally worse even at large scale. The server-side merge overhead is always present.
When to use each: Use r.Status().Patch() when your controller has 5+ workers writing status concurrently and you want zero-conflict semantics with simpler code. Use r.Status().Update() + RetryOnConflict for single-worker controllers or when status computation requires reading the latest version. Neither approach moves the write throughput ceiling on a single-node cluster.
Experiment 4: Can Recent Feature Gates Reduce API Server Pressure?
Why this matters
Experiments 2 and 3 found a hard ceiling at ~24 writes/sec from etcd serialization. Kubernetes 1.31–1.34 added several KEPs targeting API server efficiency. WatchListClient (KEP-3157) streams initial list responses as watch events to reduce memory pressure; ConcurrentWatchObjectDecode parallelizes the watch cache dispatch loop. Both should free API server goroutines — if that frees enough capacity to help our write bottleneck.
Hypothesis
Per-reconcile latency drops slightly for the 5-worker controller (less watch cache goroutine contention), the etcd write ceiling holds unchanged. The 1w vs 5w throughput gap narrows but doesn't close.
Method
New Kind cluster (widget-benchmark-fg, same kindest/node:v1.34.0 image) with explicit feature gate flags:
apiServer:
extraArgs:
feature-gates: "ConcurrentWatchObjectDecode=true"
controllerManager:
extraArgs:
feature-gates: "WatchListClient=true"ListFromCacheSnapshot, InOrderInformers, and StreamingCollectionEncoding* are reported as default-on in 1.34 and not explicitly set.
Same two variants as Experiment 2 (good 5w, good-single 1w). N = {1k, 2k, 5k, 10k}.
Note: "default-on in 1.34" claims are based on release notes research. Verification via
kubectl get --raw '/metrics' | grep kubernetes_feature_enabledis left as a follow-up.
Results
| N | Baseline 5w lat | Baseline 5w retries | FG 5w lat | FG 5w retries | Baseline 1w lat | FG 1w lat |
|---|---|---|---|---|---|---|
| 1k | 124ms | 17 | 242ms | 0 | 32ms | 49ms |
| 2k | 164ms | 13 | 246ms | 0 | 37ms | 49ms |
| 5k | 123ms | 7 | 165ms | 30 | 31ms | 26ms |
| 10k | 248ms | 0 | 248ms | 0 | 50ms | 50ms |
What this tells us
The hypothesis was wrong — the feature gates increased latency at small N rather than decreasing it. At N=1k the 5-worker controller went from 124ms to 242ms; the 1-worker went from 32ms to 49ms. At N=10k both converged back to baseline values (248ms / 50ms) because the etcd write ceiling dominates.
Why WatchListClient adds overhead here (KEP-3157):
WatchListClient is designed for one specific production failure: when many controllers restart simultaneously and each issues LIST RV="0", the API server must buffer the full response per requester — O(5 × response_size) of temporary memory per concurrent list. With hundreds of informers restarting after a network partition, this causes OOM crashes. WatchListClient prevents this by streaming objects one-at-a-time from the watch cache rather than buffering the full list.
In our scenario — fresh single-controller start, 1k–10k small CRD objects — all the costs fire with none of the benefits:
- Mandatory etcd quorum read on startup.
WatchListClientalways performs a consistent read from etcd before streaming, even on a fresh cache. The oldLIST RV="0"served directly from the watch cache without touching etcd. This adds ~10–15ms per informer startup. - Per-object watch event overhead. Instead of one JSON
itemsarray, each object is delivered as a separateADDEDwatch event with its own envelope. For small objects, the per-event framing cost dominates any memory saving. - Known 1–1.25 second bookmark timer delay (GitHub #122277). The "initial sync complete" BOOKMARK waits for the next periodic timer tick rather than firing immediately when the threshold ResourceVersion is reached. This can add a full second to every informer startup.
- The Kubernetes project itself acknowledged the tradeoff. The server-side
WatchListgate was promoted to Beta default-on in 1.32, then reverted to default-off in 1.33.WatchListClientremains opt-in only in 1.34.
WatchListClient actually helps when: controllers restart simultaneously in a large cluster (thousands of nodes, 100k+ objects), API server memory is the constraint, or objects are large (Secrets with large cert data, ConfigMaps). For a single-replica controller on a development Kind cluster, leave it at the default (off).
Final Summary
Four levers tested against the API server write bottleneck. One dominated correctness. None moved throughput.
| Factor | Correctness | Throughput | Latency |
|---|---|---|---|
GenerationChangedPredicate OFF |
Breaks — objects never converge, loop is infinite | — | — |
r.Update() for status (vs r.Status().Update()) |
Breaks — status silently discarded by API server | — | — |
MaxConcurrentReconciles: 5 (vs 1) |
✓ No change | No gain on single-node Kind | 4–5× worse (write serialization) |
r.Status().Patch() (vs Update + RetryOnConflict) |
✓ Eliminates conflict errors | No gain — ceiling is etcd serialization | Slightly worse at small N |
ConcurrentWatchObjectDecode + WatchListClient |
✓ No change | No gain | Worse at small N, same at large N |
For a single-node Kind cluster: the etcd write ceiling (~24 writes/sec on this hardware) is the hard floor on throughput, and none of the tested knobs move it. The correctness levers (GenerationChangedPredicate, status subresource) are non-negotiable. The performance levers range from neutral to counterproductive on this setup.
For a real production cluster: MaxConcurrentReconciles: 5 should show genuine throughput gains on a multi-node HA API server where writes don't all serialize on one etcd leader. WatchListClient is designed for large-scale thundering-herd scenarios. The benchmark methodology in this repo can be re-run against production-scale infrastructure to find where each lever actually wins.
Limitations and Future Work
The central limitation of this entire experiment set is that Kind serializes all etcd writes through a single node. Every performance experiment after Experiment 1 is ultimately measuring the same thing: how many different ways can we saturate a single-node etcd at ~24 writes/sec. More workers, Patch instead of Update, feature gates — they all hit the same wall, so the results look identical across all configurations at large N.
This is by design for a local development benchmark, but it means the performance conclusions are incomplete. The questions that remain unanswered:
Does
MaxConcurrentReconciles: 5actually help on a real cluster? On a 3-node HA etcd setup with multiple kube-apiserver replicas, concurrent writes can be distributed across leaders. The ~5× theoretical speedup from parallelism may actually materialize. On Kind it never can.Does
r.Status().Patch()reduce latency at scale on a real cluster? The merge patch overhead we saw (~120ms vs ~50ms at large N) is partly from the single-node API server being saturated by queued writes. On a cluster that isn't write-saturated, the smaller patch payload and absence of conflict retries might produce meaningful latency improvements.Do the feature gates help at production scale?
WatchListClientis designed for large clusters with 100k+ objects and hundreds of concurrent informers — a regime impossible to reproduce on a single-node Kind cluster. Its real benefit (preventing API server OOM during thundering-herd restarts) simply has no opportunity to manifest here.
The recommended follow-up is to re-run Experiments 2–4 against a multi-node managed Kubernetes cluster (GKE, EKS, AKS, or a self-hosted kubeadm cluster with 3+ etcd nodes and 2+ API server replicas). The benchmark scripts in this repo already support it — just point KUBECONFIG at a production cluster and run bash stress-good.sh 10000 300. The same CSV output, the same DuckDB queries, but with the artificial serialization bottleneck removed.
Running It Yourself
git clone https://github.com/pokgak/pokgak.github.io
cd pokgak.github.io/content/experiments/k8s-controller-benchmarkPrerequisites
kind— local Kubernetes clusters in Dockerkubectldocker(or OrbStack)go 1.22+duckdb— for CSV analysis (optional)
Full setup (baseline cluster)
# 1. Create the Kind cluster, build all controller images, deploy CRDs and controllers
make setup
# 2. Run Experiment 1 — 2×2 matrix (all 4 variants: good, good-single, bad-fix, bad-fix-single)
make stress N=500 # 60s observation window
make stress N=5000
# 3. Run Experiment 2 — good variants only at larger scale
bash stress-good.sh 1000 180
bash stress-good.sh 5000 180
bash stress-good.sh 10000 180
# 4. Run Experiment 3 — Patch variants
bash stress-patch.sh 1000 180
bash stress-patch.sh 5000 180Feature gate cluster (Experiment 4)
# Create a separate Kind cluster with ConcurrentWatchObjectDecode + WatchListClient enabled
make setup-fg
# Run the same good-only benchmark against the fg cluster
bash stress-good-fg.sh 1000 180
bash stress-good-fg.sh 5000 180
make cluster-delete-fg # tear down when doneAnalysing results
Each stress run writes per-second Prometheus metrics to metrics/<timestamp>-<variant>-N<scale>/metrics-<controller>.csv. Load into DuckDB:
-- in duckdb interactive shell
CREATE VIEW m AS SELECT * FROM read_csv_auto(['metrics/20260517T*/metrics-good*.csv']);
.read queries.sql -- pre-written queries for queue depth, latency, retries, convergence timeTear down
make clean # delete baseline cluster
make cluster-delete-fg # delete feature gate cluster (if created)