How infra-bench evaluates AI agents on Kubernetes tasks
A look at how infra-bench turns broken Kubernetes systems into reproducible AI agent benchmark tasks.
The hard part of a Kubernetes AI agent benchmark is not asking an agent to write YAML.
That is the easy version of the problem. Give a model a clean prompt, ask for a Deployment, and it will usually produce something that looks plausible. Sometimes it will even be correct.
Real infrastructure work is less generous.
A cluster is already running. Some things are healthy. One thing is broken. The symptom is incomplete. The obvious fix might be unsafe. The fastest path to a green check might delete the thing the operator needed to preserve.
That is the shape infra-bench tries to measure.
The question is not whether an agent can describe Kubernetes. The question is whether it can inspect a broken system, choose the right change, and leave the platform in the intended state.
That sounds simple until you try to turn it into a benchmark.
Start With The Operator Story
Every good task starts with one sentence:
A platform engineer needs to restore something because a real system is not behaving the way it should.
That sentence matters more than the Kubernetes primitive.
If the task starts as “fix a Service selector,” the benchmark is already drifting toward trivia. The agent has been handed the concept. It can skip the diagnosis. It only needs to produce the expected edit.
If the task starts as “checkout records are failing,” the agent has to do the work.
It has to inspect the namespace, find the workload, read the events, compare Secrets or Services or pods or endpoints, and build a model of what changed. It might discover a Secret projection problem. It might discover a routing problem. It might discover that the noisy resource is unrelated.
That is closer to the work platform engineers actually do.
A benchmark task should reveal the next inspection step, not just the final patch.infra-bench uses Kubernetes as the first dataset because Kubernetes is full of these small, realistic edges. A valid object can still be wrong. A healthy pod can still receive no traffic. A rollout can succeed while the application behavior is broken. A policy can look careful and still block DNS.
The benchmark has to preserve that texture.
Build A Broken Starting State
The starting point is not an empty repo.
For kubernetes-core, each task includes the environment needed to reproduce the problem. Some tasks are static manifest repairs. Others start a local Kubernetes cluster where the agent has to use kubectl against live resources.
The important part is that the broken state exists before the agent begins.
The agent does not get a perfect description of the root cause. It gets a workspace, a running cluster when the task needs one, and an operator-facing goal.
For an easy task, the symptom may point fairly directly at the issue. For a medium or hard task, the prompt should be sparse. It should usually say what a user or operator would know, not what the benchmark author knows.
That is why a task can say:
Users report that checkout records are failing.
It should not say:
The billing API is projecting the previous Secret key after a credential rotation.
The second prompt may be useful documentation. It is a bad benchmark prompt. It names the subsystem, the root cause, and the shape of the fix.
infra-bench tasks also include constraints because infrastructure repair is not just “make the test pass.” The agent may need to preserve workload identity, keep selectors stable, avoid changing images, avoid broadening access, or leave unrelated services alone.
Those constraints are not decoration. They are part of the skill being measured.
Make The Agent Inspect The System
A useful Kubernetes benchmark should force observation.
If an agent can pass by applying a memorized patch, the task is too thin. The interesting behavior starts when it has to inspect live state and decide what matters.
That can mean looking at:
- eventsPending pods, failed mounts, denied requests, scheduling failures, and rollout warnings.
- relationshipsServices to pods, Deployments to ReplicaSets, Secret keys to env vars, Roles to ServiceAccounts.
- runtime behaviorReadiness, logs, endpoints, controller status, generated resources, and whether traffic can flow.
- boundariesWhat the agent is allowed to change, what it must preserve, and which resources are unrelated.
This is where Kubernetes is useful as a benchmark domain. It punishes shallow confidence.
An agent can explain why a Service needs matching labels and still fail to notice that this Service is fine. It can know that a pod needs a toleration and still move the wrong workload onto scarce capacity. It can fix the visible error and break the less visible contract the verifier cares about.
That is the point.
The task should not reward the agent for sounding right. It should reward the agent for making the system right.
Let The Verifier Judge The Outcome
The verifier is the part I care about most.
infra-bench does not grade the final answer by reading the agent’s explanation. It runs checks against the final state.
For static tasks, that may mean parsing manifests and checking Kubernetes object semantics. For live-cluster tasks, it can mean waiting for rollouts, checking endpoints, reading logs, verifying Secret projections, checking that unrelated services stayed healthy, and making sure the agent did not create shortcut resources.
The verifier should answer one question:
Did the operator outcome happen, without cheating the shape of the task?
That last clause matters.
A weak verifier might only check that an HTTP request succeeds. A stronger verifier also checks that the original Deployment was not deleted and recreated, the Service identity stayed the same, the Secret value was not copied into a ConfigMap, unrelated workloads did not move, replica counts stayed intact, and no standalone pod was created as a bypass.
This is not about being clever with hidden tests. It is about protecting the meaning of the benchmark.
In real infrastructure work, a fix that deletes the customer’s workload and replaces it with a simpler one is usually not a fix. A fix that grants cluster-wide permissions to solve a namespace RBAC issue is usually not a fix. A fix that makes the canary pass by breaking the reporting service is not a fix.
The verifier has to encode those boundaries.
Keep The Task Reproducible
infra-bench tasks are built to run through Harbor, which gives the benchmark a consistent task shape:
instruction.mdfor the agent-facing task.task.tomlfor metadata and runtime configuration.environment/for the starting state.solution/solve.shfor the oracle solution.tests/test.shfor verification.
That structure matters because benchmark results are only useful if people can inspect and rerun the work.
The current Kubernetes dataset includes 58 tasks across easy, medium, and hard difficulty levels. The point of the difficulty split is not file count. It is operator complexity.
An easy task might have one broken relationship and direct symptoms. A medium task may require correlating a workload with a Service, Secret, Role, NetworkPolicy, or node placement rule. A hard task may involve layered failures, scarce capacity, migration constraints, or several controllers interacting at once.
That gives the benchmark more resolution than one leaderboard score.
A model that passes the easy tasks but struggles on medium scheduling or policy tasks is telling us something. A model that handles static manifest fixes but falls down on live controller behavior is telling us something else.
That is where the benchmark becomes useful.
What A Score Means
I do not want infra-bench to imply that an agent with a high pass rate should be handed production access.
That would be the wrong conclusion.
A benchmark run is evidence from a controlled environment. It can show that an agent handled a class of tasks under a specific harness, with specific tools, against known starting states and verifiers.
That evidence is still valuable.
It can show which operational tasks are becoming automatable. It can show where agents need better context or safer tools. It can show which failures are easy for language models to explain but still hard for them to repair.
Most importantly, it can make the conversation less vague.
Instead of asking whether an agent is “good at Kubernetes,” we can ask sharper questions:
- Can it recover a broken rollout without replacing the workload?
- Can it repair Secret wiring without leaking the secret into plain text?
- Can it place a canary on specialized capacity without moving unrelated services?
- Can it preserve network boundaries while restoring the required path?
- Can it find the relevant signal in a namespace with plausible distractors?
Those questions are closer to the work.
Why This Shape Matters
The long-term goal is not to build a benchmark for its own sake.
The goal is to understand which parts of infrastructure work can become checks, reports, runbooks, guardrails, remediation scripts, or safer automation. A benchmark gives that work a sharper feedback loop.
If agents consistently fail a task because the required signal is buried in events, maybe the tooling should surface that signal better. If they over-broaden permissions to pass an RBAC task, maybe the guardrail should make the safe path easier. If several models break the same hidden contract, maybe that contract deserves an explicit preflight check in real platforms.
That is the useful layer.
infra-bench is a way to turn broken systems into evidence. Not a demo. Not a vibe check.
A reproducible loop: broken state, agent action, verifier result, task artifact, and a record people can inspect.
That is the kind of Kubernetes AI agent benchmark I want to build from.
Resources
The live benchmark page shows the current Kubernetes task results, verifier logs, and agent trajectories:
The first post explains why I started the benchmark and what the first signals show:
The source benchmark tasks are published on GitHub:
kubeply/infra-bench