Testing Agent Skills
This is a note — quick thoughts, possibly AI-assisted. Not a fully fleshed article.
Testing agent skills (Claude Code plugins that guide model behavior). Based on Honeycomb's agent-skills, applied to my LGTM skill.
Core question: does the skill text actually change how the model behaves? Two testing layers at different speeds and costs.
Layer 1: Skill-Pressure Tests
Fast, cheap. Verify skill text steers model reasoning. No tools, no API calls — text in, text out.
How it works:
- Run Claude with prompt +
--max-turns 1+ no tools - Run again with skill content in system prompt
- Check required patterns appear (anti-patterns don't)
- id: aggregation-before-raw-fetch
prompt: "Error spike in checkout service. How investigate using Loki?"
without_skill:
expected_patterns:
- "(?i)\\b(grep|search|query|fetch)\\b.*\\b(log|error)\\b"
with_skill:
required_patterns:
- "(?i)count_over_time|aggregat|count.*first|overview.*first"The without_skill.expected_patterns establishes a RED baseline — confirms the model defaults to "wrong" behavior without guidance. If the model already does the right thing, the test isn't measuring anything.
Good scenario design:
- One behavioral axis per scenario
- Patterns match intent, not exact syntax —
(?i)count_over_time|aggregatcatches both specific function and general concept - Anti-patterns optional — only when verifying skill suppresses a specific bad behavior
Running: python tests/skill-pressure/run.py — ~2-3 min for 8 scenarios, no external services.
Layer 2: Scenario Tests
End-to-end, full multi-turn Claude conversations. Evaluates tool calls made.
How it works:
- Run Claude with prompt +
--output-format stream-json+ allowed tools includingSkill,Task - Run again without
Skill,Taskin allowed tools - Parse NDJSON output for all tool calls and arguments
- Score both runs against rubric, compare
Scoring rubric:
| Component | Weight | What |
|---|---|---|
| Required tools | 30% | Expected tools called? |
| Required patterns | 25% | Expected strings in tool args? |
| Anti-patterns | 20% | Bad patterns absent? |
| Tool ordering | 15% | Right sequence? |
| Recommended tools | 10% | Optional bonus |
Pass criteria:
- Comparison:
delta = with - without >= -0.1(accounts for LLM non-determinism) - Skill-only:
score >= 0.6
Running: make test-scenarios — ~20-40 min. Raw output at tests/scenarios/output/.
Results from LGTM Skill
Skill-pressure: 8/8 passed
All scenarios confirmed steering. One notable finding: percentiles-for-latency baseline was "not RED" — Sonnet already knows to suggest percentiles without the skill. Less valuable as regression test.
Scenario tests: comparison tests are inherently flaky
Skill-only tests pass 100% consistently. Comparison tests randomly fail 1-2 scenarios per run:
| Scenario | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| investigate-error-spike | PASSED | PASSED | PASSED |
| service-health-check | PASSED | FAILED | PASSED |
| trace-slow-requests | PASSED | FAILED | PASSED |
| metrics-trend-with-chart | PASSED | PASSED | FAILED |
| cross-signal-investigation | PASSED | PASSED | PASSED |
| skill-only (all 5) | PASSED | PASSED | PASSED |
Why: two independent Claude conversations produce wildly different scores for reasons unrelated to the skill. Delta between two stochastic runs isn't reliable at n=1.
What would fix it: run K times and compare averages, use skill-only as primary signal, use cheaper models for more repetitions.
Key lesson: test patterns, not tool names
Initially all 5 skill-only tests failed. Claude used Agent instead of Task for subagent calls, and sometimes ran lgtm via Bash directly. Both valid — just different tools.
Fix: remove Task from required_tools, rely on required_patterns matching lgtm loki, lgtm tempo etc. All 5 passed after.
Takeaway: required_patterns (what the agent says) > required_tools (which tool wrapper it picks). Claude has multiple equivalent paths.
Design Decisions
- Deterministic evaluation, no LLM-as-judge — all scoring is regex. Reproducible, auditable. Trade-off: patterns need careful tuning.
- Skill-pressure = inner loop — minutes, catches most skill text issues
- Scenario tests = outer loop — catches integration issues, slow and expensive
- Baseline matters — RED baseline verifies model does the wrong thing without skill. Without it, passing test might just mean model already knows the answer.