Aiman Ismail

Kubernetes Controller Anti-patterns: What Actually Costs You Performance

This is an experiment — raw data and observations, not a polished write-up.

kubernetescontroller-runtimekubebuilderoperatorsbenchmarkingkind

Four controller best-practice guidelines tested against their bad counterparts on a local Kind cluster: does GenerationChangedPredicate matter, does MaxConcurrentReconciles help, does r.Status().Patch() break the conflict bottleneck, and do recent k8s feature gates reduce API server pressure? Each factor gets its own controlled experiment with Prometheus metrics collected per-second to CSV.

Code: pokgak/pokgak.github.io — content/experiments/k8s-controller-benchmark

The Question

Controller best practices are well documented but rarely quantified. Four concrete questions:

  1. How bad does it get if you skip GenerationChangedPredicate and write annotations on every reconcile?
  2. Where does MaxConcurrentReconciles: 5 actually pay off over a single worker?
  3. Does switching from r.Status().Update() + RetryOnConflict to r.Status().Patch() break the conflict bottleneck?
  4. Do ConcurrentWatchObjectDecode and WatchListClient (k8s 1.34 feature gates) reduce the API server pressure that causes the bottleneck?

Setup

  • Cluster: Kind v1.34.0, single-node, local Docker (OrbStack)
  • CRD: Widget — spec: count, message; status: phase, processedCount, lastUpdated
  • Simulated work: 10ms sleep per reconcile (represents I/O-bound external call)
  • Metrics: Prometheus endpoint scraped to CSV every 1s via a Go scraper binary; key metrics: workqueue_depth, workqueue_adds_total, workqueue_retries_total, controller_runtime_reconcile_total{result}, controller_runtime_reconcile_time_seconds
  • Framework: controller-runtime v0.18.4, kubebuilder v4
  • Debug logging: enabled on all controllers (zapcore.DebugLevel)
  • Observation window: 60s for Experiment 1; 180s for Experiments 2–4

Each controller variant watches its own namespace so all variants can run concurrently on the same cluster without interfering.


Experiment 1: Anti-patterns vs Best Practices

Why this matters

Before measuring performance, the correctness question has to be answered first: do these controller patterns actually change whether objects converge, not just how fast? The two most-missed patterns in real operator code are the GenerationChangedPredicate and the status subresource distinction. This experiment isolates them in a controlled 2×2 matrix.

Hypothesis

  • Predicate dominates correctness. Bad controllers never converge to Ready regardless of worker count — the annotation-write loop creates a permanent reconcile backlog.
  • Predicate dominates event volume. Bad controller reconcile counts grow without bound; good controllers stabilize after the first pass.
  • More workers amplifies the loop. Five concurrent workers racing to stamp the same annotation produce more resourceVersion conflicts than a single worker, generating more retries and more wasted work. The 1-worker bad controller is throttled by its own serialization.

Method

Four controller variants running concurrently in separate namespaces:

1 worker 5 workers
Predicate ON good-single good
Predicate OFF bad-fixed-single bad-fixed-status

All four variants use r.Status().Update() correctly (status subresource enabled), isolating the predicate and worker count as the only variables.

The "bad" variants stamp a nanosecond timestamp annotation on every reconcile via r.Update(). Without the predicate, each annotation write bumps resourceVersion, fires a watch event, and re-triggers reconcile — a guaranteed infinite loop.

N = {50, 200, 500, 1000} Widget objects created concurrently in all namespaces. 60-second observation window. Reconcile counts from controller logs; Ready counts from kubectl get widgets.

Results

Reconcile counts and Ready status after 60s:

N good Ready bad-fix Ready bad-fix (5w) lat bad-fix (5w) retries bad-fix-single (1w) lat bad-fix-single (1w) retries
50 50/50 ✓ 0/50 ✗ 465ms 7 94ms 3
200 200/200 ✓ 0/200 ✗ 465ms 7 94ms 3
500 500/500 ✓ 0/500 ✗ 445ms 11 91ms 6
1,000 1000/1000 ✓ 0/1000 ✗ 430ms 20 89ms 4
5,000 5000/5000 ✓ ~1940/5000 ✗ 214ms 41 46ms 33

Time to full convergence (good-5w controller within 60s window):

N Converged?
50 ✓ ~5s
200 ✓ ~11s
500 ✓ ~21s
1,000 ✓ ~39s
5,000 ✓ (within 60s window)

Prometheus metrics at N=5000 end-of-window:

Controller queue_depth retries errors avg_lat
good (5w) 0 ✓ 33 33 130ms
good-single (1w) 0 ✓ 92 92 27ms
bad-fix (5w) 4,999 ✗ 41 41 214ms
bad-fix-single (1w) 5,000 ✗ 33 33 46ms

What this tells us

The correctness prediction was fully confirmed. Bad controllers never reach Ready at any scale because r.Update() on a CRD with subresources: status: {} silently strips status fields at the API server. The controller sets widget.Status.Phase = "Ready" in memory, calls r.Update(), gets a 200 OK back, and the write is discarded. No error, no log, no convergence.

The loop prediction was confirmed. Without GenerationChangedPredicate, every annotation write bumps resourceVersion → watch event → reconcile → annotation write → loop. At N=200 the bad-5w controller produced 4,650 reconcile events in 60 seconds (23/s per 200 objects) while the good controller produced 400 and went silent.

The "more workers amplifies the loop" prediction was also confirmed — but in a specific way. The 5-worker bad controller produced fewer reconciles than the 1-worker at large N (12,683 vs 5,856 at N=1000). The reason: with 5 workers racing to annotate the same object pool, they produce more resourceVersion conflicts (7 errors vs 3). Each conflict triggers a rate-limiter backoff — the loop is actually slower under contention. The 1-worker controller serializes cleanly with no conflicts, achieving higher raw event throughput despite being bottlenecked to ~100/s.

Average reconcile latency confirms this: bad-fix-5w at 465ms vs bad-fix-single at 94ms. Five concurrent annotation writes on the same 200 objects create 5× more API server pressure per object, inflating per-write latency.

Connection to real frameworks: This maps directly to how kopf (the Python controller framework) works internally — it writes handler progress and state into annotations on every handler invocation. Without GenerationChangedPredicate's equivalent, every annotation write re-enters the handler. At low object counts the loop is fast and handlers are idempotent, so it's invisible. At 1000+ objects with concurrent external changes (e.g., Pod status updates), the annotation cascade grows faster than the handler queue can drain — the production bogging-down at ~1000 pods described in the setup.

This experiment eliminated the bad variants from further investigation. Experiments 2–4 focus only on the good patterns.


Experiment 2: When Does MaxConcurrentReconciles Pay Off?

Why this matters

Experiment 1 showed the 5-worker good controller converges faster at modest N. The natural follow-up: does the benefit scale? Is there a regime where 5 workers saturates the workload, converges faster, and then the advantage disappears? Or is there actually a regime where 5 workers actively hurts a correctly-written controller?

Hypothesis

5 workers should produce ~5× faster convergence at low N where the bottleneck is worker throughput. At very large N (10k+), both configurations should saturate at the same throughput ceiling and produce the same number of reconciles per unit time.

Method

Only the two good variants (good 5w, good-single 1w). Bad controllers scaled to 0 replicas to eliminate API server noise. 180-second observation window. N = {1,000; 2,000; 10,000}.

Results

N good-1w lat good-1w errors good-5w lat good-5w errors Converged?
1,000 32ms 0 124ms 17 ✓ both
2,000 37ms 0 164ms 13 ✓ both
10,000 50ms 0 248ms 0 ✗ both (~5,600 remaining at 180s)

N=5,000 was not run in the isolated good-only mode; Experiment 1 shows both good variants converge within 60s at N=5k (queue=0 at window close).

At N=10k both controllers drain at ~200 objects/10s — identical throughput. Queue at 180s: good-single 5,614; good 5,610.

What this tells us

The hypothesis was wrong. 5 workers never beats 1 worker at any tested scale on this cluster. At every N:

  • The 5-worker controller has 4–5× higher per-reconcile latency
  • It produces 13–17 conflict errors at small N (zero on the 1-worker version)
  • It processes fewer or equal total reconciles per unit time

The explanation: with 5 concurrent reconcilers each issuing r.Status().Update(), all 5 goroutines hit the Kind API server simultaneously. The single-node Kind API server (etcd + kube-apiserver in one Docker container) serializes writes internally — etcd handles one write at a time. The result is not 5× parallelism; it's 5 goroutines queued behind each other with higher per-write latency, producing the same aggregate throughput.

At N=10,000 the ceiling is explicit: ~24 writes/sec, achieved equally by both configurations. The 5-worker controller at 248ms avg × 5 workers = theoretical 20/s; the 1-worker at 50ms = theoretical 20/s. They're the same calculation, confirming the API server is the shared wall.

The 17 errors at N=1k (vs 0 at N=10k) have a distinct cause: at small queue depth, all 5 workers can land on the same object simultaneously — one worker holds a fresh resourceVersion and writes; the others hold a stale copy and get a 409 Conflict. At N=10k the queue contains 10,000 distinct objects, so workers almost never collide.

For production: MaxConcurrentReconciles matters when reconcile work is CPU-bound or calls external APIs (not the Kubernetes API server) — work that genuinely parallelizes. On a single-node development cluster or CI Kind cluster, 1 worker is optimal. On a real multi-node HA cluster with distributed etcd, more workers should show genuine throughput gains since writes don't serialize on one node.

This experiment raises the question for Experiment 3: if conflicts are the visible cost of 5 workers at small N, does eliminating conflicts (via r.Status().Patch()) let 5 workers finally pay off?


Experiment 3: Does r.Status().Patch() Move the Ceiling?

Why this matters

Experiment 2's main cost at small N was conflicts: 5 workers racing on the same resourceVersion produce 409 errors that trigger RetryOnConflict retries. r.Status().Patch() with a merge patch eliminates the conflict entirely — no resourceVersion matching required. If conflicts were the bottleneck, switching to Patch should break the ~24/s ceiling. If etcd write serialization is the bottleneck, Patch removes noise but doesn't move the ceiling.

Hypothesis

  • Patch eliminates resourceVersion conflicts at all scales (zero retries, zero errors).
  • If retries were the bottleneck: latency drops, throughput rises above the 24/s ceiling.
  • If etcd serialization is the bottleneck: conflicts gone, ceiling unchanged.

Method

Two new controller variants: good-patch (5w + r.Status().Patch()), good-single-patch (1w + r.Status().Patch()). All other patterns identical to good and good-single. client.MergeFrom merge patch — no RetryOnConflict wrapper needed.

// Before (Update + RetryOnConflict):
err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
    current := &Widget{}
    r.Get(ctx, req.NamespacedName, current)  // extra GET on each retry
    current.Status.Phase = "Ready"
    return r.Status().Update(ctx, current)
})

// After (Patch — no retry needed):
base := widget.DeepCopy()
widget.Status.Phase = "Ready"
r.Status().Patch(ctx, widget, client.MergeFrom(base))

N = {1,000; 2,000; 5,000; 10,000}. 180-second window.

Results

N Method 5w lat 5w retries 1w lat 1w retries Converged?
1k Update 124ms 17 32ms 0 ✓ both
1k Patch 242ms 0 49ms 0 ✓ both
2k Update 164ms 13 37ms 0 ✓ both
2k Patch 246ms 0 49ms 0 ✓ both
5k Patch 165ms 16 35ms 6 ✗ (~1k remaining at 180s)
10k Update 248ms 0 50ms 0 ✗ (~5.6k remaining at 180s)
10k Patch 248ms 0 50ms 0 ✗ (~5.6k remaining at 180s)

What this tells us

Conflicts: eliminated ✓ Zero retries at both scales including N=1k where Update had 17.

Throughput ceiling: unchanged. Both variants still hit ~24 reconciles/sec at N=10k. The ceiling is etcd write serialization, not retry overhead. Patch removes noise but not the fundamental constraint.

Patch is slower at small N for the 5-worker variant — 242ms vs 124ms at N=1k. The server-side JSON merge processing adds latency that the plain Update path didn't have. This shows up when the API server isn't already saturated with queued writes. At N=10k the serialization wait dominates and both methods converge to ~248ms.

The 1-worker variant barely changed — 49ms vs 32ms (Patch slightly slower). The 1-worker controller never had conflicts to begin with, so Patch adds overhead without removing any existing wasted work.

Note: Patch at N=10k produced 4,365 successes vs Update's 4,385 — Patch is marginally worse even at large scale. The server-side merge overhead is always present.

When to use each: Use r.Status().Patch() when your controller has 5+ workers writing status concurrently and you want zero-conflict semantics with simpler code. Use r.Status().Update() + RetryOnConflict for single-worker controllers or when status computation requires reading the latest version. Neither approach moves the write throughput ceiling on a single-node cluster.


Experiment 4: Can Recent Feature Gates Reduce API Server Pressure?

Why this matters

Experiments 2 and 3 found a hard ceiling at ~24 writes/sec from etcd serialization. Kubernetes 1.31–1.34 added several KEPs targeting API server efficiency. WatchListClient (KEP-3157) streams initial list responses as watch events to reduce memory pressure; ConcurrentWatchObjectDecode parallelizes the watch cache dispatch loop. Both should free API server goroutines — if that frees enough capacity to help our write bottleneck.

Hypothesis

Per-reconcile latency drops slightly for the 5-worker controller (less watch cache goroutine contention), the etcd write ceiling holds unchanged. The 1w vs 5w throughput gap narrows but doesn't close.

Method

New Kind cluster (widget-benchmark-fg, same kindest/node:v1.34.0 image) with explicit feature gate flags:

apiServer:
  extraArgs:
    feature-gates: "ConcurrentWatchObjectDecode=true"
controllerManager:
  extraArgs:
    feature-gates: "WatchListClient=true"

ListFromCacheSnapshot, InOrderInformers, and StreamingCollectionEncoding* are reported as default-on in 1.34 and not explicitly set.

Same two variants as Experiment 2 (good 5w, good-single 1w). N = {1k, 2k, 5k, 10k}.

Note: "default-on in 1.34" claims are based on release notes research. Verification via kubectl get --raw '/metrics' | grep kubernetes_feature_enabled is left as a follow-up.

Results

N Baseline 5w lat Baseline 5w retries FG 5w lat FG 5w retries Baseline 1w lat FG 1w lat
1k 124ms 17 242ms 0 32ms 49ms
2k 164ms 13 246ms 0 37ms 49ms
5k 123ms 7 165ms 30 31ms 26ms
10k 248ms 0 248ms 0 50ms 50ms

What this tells us

The hypothesis was wrong — the feature gates increased latency at small N rather than decreasing it. At N=1k the 5-worker controller went from 124ms to 242ms; the 1-worker went from 32ms to 49ms. At N=10k both converged back to baseline values (248ms / 50ms) because the etcd write ceiling dominates.

Why WatchListClient adds overhead here (KEP-3157):

WatchListClient is designed for one specific production failure: when many controllers restart simultaneously and each issues LIST RV="0", the API server must buffer the full response per requester — O(5 × response_size) of temporary memory per concurrent list. With hundreds of informers restarting after a network partition, this causes OOM crashes. WatchListClient prevents this by streaming objects one-at-a-time from the watch cache rather than buffering the full list.

In our scenario — fresh single-controller start, 1k–10k small CRD objects — all the costs fire with none of the benefits:

  1. Mandatory etcd quorum read on startup. WatchListClient always performs a consistent read from etcd before streaming, even on a fresh cache. The old LIST RV="0" served directly from the watch cache without touching etcd. This adds ~10–15ms per informer startup.
  2. Per-object watch event overhead. Instead of one JSON items array, each object is delivered as a separate ADDED watch event with its own envelope. For small objects, the per-event framing cost dominates any memory saving.
  3. Known 1–1.25 second bookmark timer delay (GitHub #122277). The "initial sync complete" BOOKMARK waits for the next periodic timer tick rather than firing immediately when the threshold ResourceVersion is reached. This can add a full second to every informer startup.
  4. The Kubernetes project itself acknowledged the tradeoff. The server-side WatchList gate was promoted to Beta default-on in 1.32, then reverted to default-off in 1.33. WatchListClient remains opt-in only in 1.34.

WatchListClient actually helps when: controllers restart simultaneously in a large cluster (thousands of nodes, 100k+ objects), API server memory is the constraint, or objects are large (Secrets with large cert data, ConfigMaps). For a single-replica controller on a development Kind cluster, leave it at the default (off).


Final Summary

Four levers tested against the API server write bottleneck. One dominated correctness. None moved throughput.

Factor Correctness Throughput Latency
GenerationChangedPredicate OFF Breaks — objects never converge, loop is infinite
r.Update() for status (vs r.Status().Update()) Breaks — status silently discarded by API server
MaxConcurrentReconciles: 5 (vs 1) ✓ No change No gain on single-node Kind 4–5× worse (write serialization)
r.Status().Patch() (vs Update + RetryOnConflict) ✓ Eliminates conflict errors No gain — ceiling is etcd serialization Slightly worse at small N
ConcurrentWatchObjectDecode + WatchListClient ✓ No change No gain Worse at small N, same at large N

For a single-node Kind cluster: the etcd write ceiling (~24 writes/sec on this hardware) is the hard floor on throughput, and none of the tested knobs move it. The correctness levers (GenerationChangedPredicate, status subresource) are non-negotiable. The performance levers range from neutral to counterproductive on this setup.

For a real production cluster: MaxConcurrentReconciles: 5 should show genuine throughput gains on a multi-node HA API server where writes don't all serialize on one etcd leader. WatchListClient is designed for large-scale thundering-herd scenarios. The benchmark methodology in this repo can be re-run against production-scale infrastructure to find where each lever actually wins.

Limitations and Future Work

The central limitation of this entire experiment set is that Kind serializes all etcd writes through a single node. Every performance experiment after Experiment 1 is ultimately measuring the same thing: how many different ways can we saturate a single-node etcd at ~24 writes/sec. More workers, Patch instead of Update, feature gates — they all hit the same wall, so the results look identical across all configurations at large N.

This is by design for a local development benchmark, but it means the performance conclusions are incomplete. The questions that remain unanswered:

  • Does MaxConcurrentReconciles: 5 actually help on a real cluster? On a 3-node HA etcd setup with multiple kube-apiserver replicas, concurrent writes can be distributed across leaders. The ~5× theoretical speedup from parallelism may actually materialize. On Kind it never can.

  • Does r.Status().Patch() reduce latency at scale on a real cluster? The merge patch overhead we saw (~120ms vs ~50ms at large N) is partly from the single-node API server being saturated by queued writes. On a cluster that isn't write-saturated, the smaller patch payload and absence of conflict retries might produce meaningful latency improvements.

  • Do the feature gates help at production scale? WatchListClient is designed for large clusters with 100k+ objects and hundreds of concurrent informers — a regime impossible to reproduce on a single-node Kind cluster. Its real benefit (preventing API server OOM during thundering-herd restarts) simply has no opportunity to manifest here.

The recommended follow-up is to re-run Experiments 2–4 against a multi-node managed Kubernetes cluster (GKE, EKS, AKS, or a self-hosted kubeadm cluster with 3+ etcd nodes and 2+ API server replicas). The benchmark scripts in this repo already support it — just point KUBECONFIG at a production cluster and run bash stress-good.sh 10000 300. The same CSV output, the same DuckDB queries, but with the artificial serialization bottleneck removed.

Running It Yourself

git clone https://github.com/pokgak/pokgak.github.io
cd pokgak.github.io/content/experiments/k8s-controller-benchmark

Prerequisites

  • kind — local Kubernetes clusters in Docker
  • kubectl
  • docker (or OrbStack)
  • go 1.22+
  • duckdb — for CSV analysis (optional)

Full setup (baseline cluster)

# 1. Create the Kind cluster, build all controller images, deploy CRDs and controllers
make setup

# 2. Run Experiment 1 — 2×2 matrix (all 4 variants: good, good-single, bad-fix, bad-fix-single)
make stress N=500        # 60s observation window
make stress N=5000

# 3. Run Experiment 2 — good variants only at larger scale
bash stress-good.sh 1000 180
bash stress-good.sh 5000 180
bash stress-good.sh 10000 180

# 4. Run Experiment 3 — Patch variants
bash stress-patch.sh 1000 180
bash stress-patch.sh 5000 180

Feature gate cluster (Experiment 4)

# Create a separate Kind cluster with ConcurrentWatchObjectDecode + WatchListClient enabled
make setup-fg

# Run the same good-only benchmark against the fg cluster
bash stress-good-fg.sh 1000 180
bash stress-good-fg.sh 5000 180

make cluster-delete-fg   # tear down when done

Analysing results

Each stress run writes per-second Prometheus metrics to metrics/<timestamp>-<variant>-N<scale>/metrics-<controller>.csv. Load into DuckDB:

-- in duckdb interactive shell
CREATE VIEW m AS SELECT * FROM read_csv_auto(['metrics/20260517T*/metrics-good*.csv']);
.read queries.sql   -- pre-written queries for queue depth, latency, retries, convergence time

Tear down

make clean           # delete baseline cluster
make cluster-delete-fg   # delete feature gate cluster (if created)