Kubernetes AI Agent Benchmark Results: What The First Runs Show

The first Kubernetes benchmark results show where AI agents are already useful on infrastructure repair tasks, and where hard operational failures still break them.

Thomas ChaigneauJune 2, 20269 min read

#kubernetesaiagentbenchmarkresults #infrabench #kubernetes #aiagents #benchmarks

The first Kubernetes AI agent benchmark results from infra-bench are useful for a reason that is easy to miss.

Not because agents are ready to run production Kubernetes alone.

Because the useful signal is narrower:

AI agents are getting good at routine Kubernetes repair tasks. They still struggle when the task requires careful operational judgment, layered diagnosis, or preserving a contract that is not the loudest symptom.

That is the part worth studying.

The current kubernetes-core dataset has 58 tasks. Each task starts from a broken Kubernetes state and asks an agent to restore an operator-facing outcome. Some tasks are static manifest repairs. Others run against a live cluster with kubectl, runtime state, events, logs, rollouts, and verifiers.

The public benchmark page now shows 9 runs across those 58 tasks.

The best current run passes 49 of 58 tasks, or 84.5%. Several strong runs land at 46 of 58 tasks, or 79.3%. The lowest current runs still pass 40 of 58 tasks, or 69.0%.

Those numbers are not final. They are a snapshot of the current dataset and current runs, generated on May 5, 2026.

But even this first snapshot says something useful.

The Leaderboard Is Not The Point

It is tempting to read a benchmark as a ranking table.

That is fine for a first pass. A score is easy to understand. If one run passes 49 tasks and another passes 40, there is a difference worth noticing.

But a single number is not enough for infrastructure work.

A Kubernetes task can fail in several different ways. An agent can make the right edit to the wrong resource. It can fix the visible symptom and break a less visible boundary. It can make traffic flow by widening access too much. It can recover one pod and lose the workload identity the operator needed to preserve.

So the useful question is not only:

Which model got the highest pass rate?

The useful questions are sharper:

That is where the benchmark becomes useful.

It gives us a way to look past one leaderboard score and ask what kind of infrastructure work is becoming automatable.

Easy Tasks Are Close To Solved

The first clear signal is the easy split.

The dataset has 22 easy tasks. Across the current runs, easy-task pass rates are mostly between 95.5% and 100%.

An easy task might ask the agent to repair a bad Service name, fix a missing ConfigMap, add a narrow RBAC verb, restore a readiness probe path, or place a workload on the intended labeled node. The problem still lives in Kubernetes. The agent still has to inspect files or live state. The verifier still checks the final outcome.

The diagnosis is usually local: one broken relationship, a symptom close to the cause, and a small fix.

That is exactly the kind of work where agents should start to become useful. Not autonomous platform engineers. More like careful assistants for bounded repair work.

If an agent can repeatedly identify a selector mismatch, fix a missing env var reference, or add the least-privilege permission a Job actually needs, that matters. Those tasks consume human time today. They are not the most strategic work a platform engineer does. They are still real work.

The interesting product question is what to build around that capability.

The right shape is not always “let the agent do everything.” It can be a preflight check, a suggested patch, a runbook step with evidence attached, or automation that only runs after the verifier proves the contract still holds.

The easy tasks are where that starts to feel practical.

Medium Tasks Are Where The Benchmark Gets Useful

The medium split is more revealing.

The dataset has 23 medium tasks. Current medium pass rates range from 65.2% to 87.0%.

That spread matters.

Medium tasks usually require correlation. The agent may need to connect a failed rollout to a ConfigMap value, a pending pod to a request or taint, a broken route to a Service and endpoints, or an HPA failure to the metrics path it depends on.

The agent has to inspect the system and decide which signal is relevant.

Some of the hardest current medium tasks are not exotic.

stabilize-cpu-throttled-worker has 0 passes across the current 9 runs. repair-restricted-multi-container-pod also has 0. repair-worker-hpa-scaling-inputs has 1. prepare-node-drain-with-pdb has 3.

These are not science fiction infrastructure problems.

They are the kind of operational tasks teams run into when workloads have resource pressure, security constraints, autoscaling dependencies, or maintenance requirements.

That is why the result is useful.

Kubernetes knowledge is not enough here. The hard part is applying it under live constraints: CPU pressure, pod security context, preserved container contracts, and the metrics path an HPA needs before it can scale.

This is the gap infra-bench is trying to measure.

The benchmark is not asking whether the agent can talk about Kubernetes. It is asking whether the agent can leave the system working.

Hard Tasks Still Separate The Field

The hard split is the clearest reason not to overclaim.

The dataset has 13 hard tasks. Current hard-task pass rates range from 23.1% to 61.5%.

That is a big spread.

The strongest current runs pass 8 of 13 hard tasks. The weakest current hard split passes 3 of 13.

Hard tasks may involve multiple resources, controller behavior, state preservation, policy boundaries, migration constraints, or a broken path where the first obvious fix is not safe enough.

Some current hard tasks fail (nearly) everywhere:

Those failures are more useful than they look.

They show where the problem is not always “the model needs to know more Kubernetes.” The missing piece can be better observation tools, explicit safety constraints, an earlier verifier, a human decision between risky changes, or platform tooling that makes the hidden contract visible before an agent touches anything.

That is the layer Kubeply should build from.

Not confidence.

Evidence.

Category Results Tell A Better Story

The category breakdown is also useful, with the usual caveat that the dataset is still small.

Across all current runs:

I would not turn that into a universal claim about Kubernetes yet.

There are only 58 tasks. Some categories have fewer examples than others. Task design can affect the result. A category with 3 tasks is not the same kind of evidence as a category with 11.

Still, the shape is directionally useful.

Configuration and secrets look relatively strong in the current dataset. Service connectivity is also mostly within reach, though not solved. Access and isolation is harder, especially when the right fix has to restore a path without broadening a boundary.

That matches the intuition behind real infrastructure work.

It is often easier to repair a missing key than to preserve a network boundary. It is often easier to reconnect a Service than to migrate a workload while keeping state, ownership, and runtime behavior intact.

The benchmark gives that intuition a testable shape.

What A Passing Result Actually Means

A passed task means the verifier accepted the final state.

That matters because infra-bench does not grade a nice explanation. It checks whether the operator outcome happened.

For a Service task, that may mean traffic reaches the intended workload. For a rollout task, it may mean the Deployment becomes available and the application behavior is restored. For an access task, it may mean the allowed path works and the denied path stays denied. For a storage task, it may mean the original stateful identity is preserved.

That is why I care about verifiers.

A model can sound right and still leave the system broken. It can also sound messy and still make the correct change. Infrastructure should be judged by the final state, not by the confidence of the answer.

The verifier is what keeps the benchmark honest.

It also keeps the results useful for product work.

If the task passes, we can inspect the trajectory and ask what the agent did right. If it fails, we can inspect the verifier log and ask what contract was missed. Over time, repeated failures should point to tooling-worthy pain: missing diagnostics, unsafe shortcuts, weak runbooks, poor preflight checks, or places where human review should remain mandatory.

That is the real output of the benchmark.

What This Does Not Prove

These results do not prove that an agent should have production cluster access.

They do not prove that Kubernetes work is solved.

They do not prove that a high-scoring run will behave safely in a real company’s platform, with its own constraints, history, permissions, cloud provider, deployment process, and incident pressure.

The current tasks are controlled benchmark tasks. That is the point. A benchmark has to be reproducible before it can be useful.

But controlled evidence is still evidence.

It is better than a demo where the agent is handed a clean prompt and a forgiving environment. It is better than asking whether a model is “good at DevOps.” It is better than judging infrastructure automation by vibes.

The first results show that agents are already strong enough to deserve serious evaluation on real infrastructure tasks.

They also show that the hard parts are still hard.

That is the honest reading.

The Next Useful Question

The next useful question is not “which model wins?”

It is:

Which infrastructure tasks are ready for guarded automation, which ones need better tools around the agent, and which ones still need a human operator in the loop?

That is what I want infra-bench to help answer, and what I am building toward at Kubeply.

If an agent consistently repairs a bounded class of failures, we should make that work easier to delegate safely.

If agents consistently fail the same task, we should not wave it away. We should look at the failure and ask what the platform was missing.

Maybe the missing piece is observability. Maybe it is a clearer runbook. Maybe it is a preflight check. Maybe it is a guardrail that blocks broad RBAC changes or detects when a fix breaks an unrelated route. Maybe it is a better benchmark task.

That loop is the point:

broken state, agent action, verifier result, failure pattern, better tooling.

Not a magic replacement for platform engineers.

A way to understand which pieces of infrastructure work can become safer, smaller, and less repetitive.

Resources

The live benchmark page shows the current Kubernetes task results, pass rates, difficulty breakdowns, verifier logs, and agent trajectories:

View live infra-bench results

The source benchmark tasks are published on GitHub:

Open the source repository

If you want the background before reading the results, start here:

Read the benchmark introduction

The task design post explains how broken starting states, verifiers, and Harbor-compatible tasks make the benchmark reproducible:

Read the task design post