Introducing infra-bench

The open benchmark for measuring AI agents on realistic infrastructure work.

Thomas Chaigneau May 4, 2026 5 min read

#infrabench #kubernetes #aiagents #benchmarks

Infra-Bench started with an annoying gap.

Agent demos are getting good. A model can explain Kubernetes, write a Deployment, summarize an incident, and produce a confident plan in a few seconds.

That is useful. It is also not the same thing as fixing infrastructure.

The question I care about is simpler, and harder:

Can an agent take a broken system, figure out what matters, make the smallest reasonable change, and leave the platform working?

That is the work infra-bench tries to measure.

A Service can exist and still route to nothing. A Deployment can be valid YAML and still fail at runtime. A NetworkPolicy can fix one path while breaking another. A Secret can be mounted, but in the wrong shape. Infrastructure is full of these small traps. The answer is rarely just “write better YAML.”

You get the point.

Measuring The Work

The long-term idea behind Kubeply is to automate the boring infrastructure work: the repetitive checks, fixes, and glue tasks that take time without needing a platform engineer’s best attention.

Time is finite. Even strong engineers can only carry so many complicated problems in a day.

Automating the low-value work should give platform teams more room for the work where their craft actually matters: architecture, reliability, product velocity, and the hard tradeoffs behind good internal platforms.

Generic coding benchmarks are not enough for this. They usually reward source edits, unit tests, or algorithmic reasoning.

Infrastructure has a different texture. There is state. There are side effects. There are constraints you should not bulldoze through just to make a check pass.

So infra-bench tasks start from a broken environment and an operator-facing goal. The agent gets the problem, not the answer. The verifier decides whether the system actually works.

That verifier is the important part. It is the judge, not the vibe of the answer.

kubernetes-core

The first dataset is kubernetes-core: 58 tasks across easy, medium, and hard levels.

Some are static manifest repairs. Others run against live Kubernetes clusters where runtime behavior matters. The tasks are split across the 8 categories we use to describe core Kubernetes platform work:

I do not want another YAML trivia benchmark. A good task should feel like something a platform engineer might actually see: a rollout is stuck, a worker cannot read config, an HPA cannot scale, a service stopped routing, a workload needs to move to a GPU node, or a controller upgrade changed an API contract.

The agent has to build the right mental model from partial evidence. Deleting the resource may be wrong. Broadening access may be wrong. Replacing the workload with a simpler one may be wrong. The goal is not to make the cluster look green. The goal is to solve the operator problem.

First Signals

The first kubernetes-core results are early, but they are already useful.

Current runs range from 67.2% to 84.5% pass rate on 58 tasks. The best run passes 49 of 58 tasks. That is strong enough to show real capability, and still far enough from solved to be interesting.

The difficulty split is where the benchmark starts to say something. Easy tasks are close to solved for the strongest runs. Medium tasks are mostly within reach. Hard tasks still separate models: the best current run passes 8 of 13 hard tasks, while the others land between 4 and 7.

That spread matters more than one leaderboard number. It starts to show which work is becoming routine for agents, which work is brittle, and where the reasoning still falls apart.

I am intentionally keeping the category analysis light here. The kubernetes-core results deserve their own post, with task examples, model-by-model comparisons, and failure modes.

Why This Matters

I do not think a benchmark result tells you an agent is ready to run production infrastructure alone.

That would be the wrong conclusion.

What it can do is make the conversation more honest.

If an agent can reliably fix service discovery, repair config wiring, or recover a broken rollout in a controlled benchmark, that is useful evidence.

It tells us where automation can start to take work off the table.

If several strong models fail the same scheduling or policy task, that is useful too. It tells us the problem probably needs better context, better tools, or a human in the loop.

That is the layer I want Kubeply to build from: not vague confidence, not a polished demo, but evidence from broken systems, runtime checks, and reproducible tasks.

Resources

The live benchmark page shows the current runs, pass rates, difficulty breakdowns, task results, verifier logs, and agent trajectories:

View live Infra-Bench results

The source benchmark tasks are published on GitHub:

kubeply/infra-bench