AXI Principles Experiment: lgtm-cli Before & After

Testing whether AXI principles applied to lgtm-cli measurably improve agent performance on observability tasks.

The Question

Do AXI principles (structured envelopes, contextual hints, definitive empty states) reduce agent errors and token usage when querying Grafana LGTM backends?

Setup

Changes applied (PR #5):

Definitive empty states — envelope includes empty: true and message: "No results found" when count is 0
Contextual disclosure — every command returns hints[] with concrete next-step command templates using <placeholders>
Limit-reached warnings — log/trace queries detect when results hit the limit and add a hint
Errors to stdout in envelope mode — structured errors go to stdout so agents always capture them

Method: Three investigation tasks, each run twice: raw JSON output (baseline) vs envelope+hints. Isolated Sonnet agent with identical prompts except --envelope flag. Tracking wrapper logged every CLI call.

Instance: Production Grafana Cloud stack with Loki, Prometheus, and Tempo.

Experiment 1: Error Log Discovery

Why this matters: Tests the most common observability workflow — find what's broken, get examples. This is where --help avoidance and hint-following should show the biggest improvement.

Hypothesis: Envelope agent will make fewer CLI calls (hints replace --help lookups) and produce less output tokens (structured responses vs raw JSON dumps). Expect ~20% reduction in calls.

Method: Prompt: "Find what apps are logging errors and show example error logs from the top 2 most active apps."

Results:

Metric	Raw	Envelope	Delta
CLI calls	20	15	-25%
Failed calls	0	0	—
Output tokens	32,316	15,838	-51%
Duration (ms)	39,172	20,872	-47%
Agent tool uses	22	17	-23%

What this tells us: Hypothesis confirmed. Envelope agent followed hints like "get values -> lgtm loki label-values <label>" instead of spending calls on --help and trial-and-error. Both agents identified the same top error-producing services. The 25% call reduction matches our prediction; the 51% token reduction exceeded it.

Experiment 2: Cross-Signal Correlation

Why this matters: Cross-signal correlation (traces -> logs) is the hardest observability task for agents — requires knowing the relationship between backends and the right query patterns. This tests whether cross-signal hints can teach that relationship.

Hypothesis: Envelope agent will make significantly fewer errors because cross-signal hints (tempo trace -> loki query) teach the backend relationship directly. Expect error count to drop by 50%+.

Method: Prompt: "Find the slowest traces in the last 15 minutes and check if there are corresponding error logs."

Results:

Metric	Raw	Envelope	Delta
CLI calls	34	29	-15%
Failed calls	10	2	-80%
Output tokens	3,741,744	155,024	-96%
Duration (ms)	35,478	35,900	~same
Agent tool uses	35	30	-14%

What this tells us: Hypothesis confirmed — error reduction exceeded prediction (80% vs 50%). The raw agent made 10 failed calls due to CLI syntax confusion (wrong flags, wrong subcommands). The 96% token reduction is partly misleading — raw agent happened to dump full trace payloads — but the cross-signal hint was used directly: after tempo trace <id>, envelope agent saw "find logs -> lgtm loki query '{traceID=\"<id>\"}' and used it.

Experiment 3: GPU Metrics Discovery

Why this matters: Tests whether hints help with domain-specific knowledge (PromQL patterns for GPU metrics). If hints can only improve CLI discoverability but not domain expertise, this task should show no improvement.

Hypothesis: Hints won't help here — both agents need to figure out PromQL patterns like {__name__=~".*DCGM.*"} which is domain knowledge, not CLI discoverability. Expect similar call counts and error rates.

Method: Prompt: "Check what Prometheus metrics are available for GPU monitoring and get current utilization values."

Results:

Metric	Raw	Envelope	Delta
CLI calls	11	13	+18%
Failed calls	1	3	+2
Output tokens	131,849,807	13,397,202	-90%
Duration (ms)	25,505	9,108	-64%
Agent tool uses	13	16	+23%

What this tells us: Hypothesis confirmed — hints didn't help with PromQL patterns. Envelope agent made more calls and had more failures. The 90% token reduction is entirely because raw agent ran prom series with broad regex returning 500MB of JSON, not because of any behavioral improvement from hints.

Final Summary

What worked

Finding	Evidence
Contextual hints reduce `--help` calls	Raw agents consistently called `--help` before using subcommands; envelope agents skipped this
Cross-signal hints are the killer feature	`tempo trace` -> `loki query` hint taught backend relationship (Task 2: 80% error reduction)
Error reduction > token savings	10 -> 2 failed calls (Task 2) saved more real time than output size reduction

What didn't work

Finding	Evidence
Hints can't teach domain expertise	PromQL/LogQL/TraceQL patterns need to be known upfront (Task 3: more errors with envelope)
Output truncation is the missing piece	Biggest token cost = massive API responses (500MB `prom series`). Our changes didn't address this.
Envelope overhead on small responses	~30% more tokens for simple list/get commands where hints are noise
TOON format doesn't help for observability	Better PromQL gives 500x reduction vs TOON's 40%. Teach the agent to aggregate server-side.

Next steps

Content truncation — auto-truncate over N tokens with "(truncated, showing 50 of 1120 results)". Biggest cost driver.
Domain-specific suggestions — after prom labels, suggest common PromQL for discovered metrics
Adaptive hints — only include when response is ambiguous, skip on terminal commands
Run at scale — these are n=1. Need 10+ iterations for statistical significance.