Coding agents are software systems that read a codebase, plan a change, edit files, run commands, and ship work with minimal human keystrokes. That is the short version. The longer version, the one that actually matters in 2026, is that the model is the easy part. The interesting work has moved to the runtime, the cost model, the security boundary, and the deploy loop the agent lives inside.
Every list of “best coding agents” right now reads like a benchmark leaderboard. Useful, but it skips the part that decides whether an agent-driven workflow actually survives the second month. The best agent in the world still produces a diff that has to be built, tested, deployed, observed, and rolled back if the test suite is honest. Most of the production problems with coding agents are not agent problems. They are runtime problems, deploy problems, and cost problems wearing an agent mask.
This post is a working engineer’s take on what is real, what is hype, and what to set up before you let an agent touch the production repo. It assumes you already know that coding agents exist, that they can write code, and that someone in the company is asking whether to use one. The questions worth answering are harder than those.
Table of contents
- The short version
- What a coding agent actually does in production
- The three layers people conflate
- The real cost of a coding agent
- The runtime an agent actually needs
- A deploy loop that does not bite you
- Security, secrets, and the new attack surface
- The economics of running your own agent
- How to evaluate a coding agent for your team
- What I would not outsource to an agent yet
- FAQ
The short version
A coding agent is a system that combines a large language model with a planning loop, a tool interface (shell, editor, file system, web), and a feedback channel (tests, builds, deploy logs, runtime errors). The model proposes the next action, the tool runs it, the feedback returns, and the model decides what to do next. The “decide what to do next” is the part that makes it an agent and not a fancy autocomplete.
What has changed since 2024 is the runtime. Modern agents no longer sit inside a chat tab waiting for a developer to copy the diff into a pull request. They open the branch, run the tests, watch the deploy, read the log, and try again. The interesting engineering work is no longer the model. It is the platform that hosts the agent, scopes its permissions, holds its secrets, and ships its work.
What a coding agent actually does in production
A useful mental model is a junior engineer with a great memory and zero social skills. The junior reads the brief, opens the repo, makes a change, runs the test suite, reads the error, fixes the error, opens a pull request, and waits. The differences from a human junior are real but smaller than the marketing suggests:
- The agent does not get tired, but it also does not know when it is confidently wrong. It will run a test, see it pass, and not notice that the test was nonsense.
- The agent does not lose context across long sessions, but it does fill that context with notes it invented and now believes. Long sessions need explicit scratchpads and review checkpoints.
- The agent does not negotiate scope, but it will happily rewrite thirty files to fix a three-line bug. The brief is the boundary.
In a real production week, the agent is good at the mechanical layer: scaffolding, refactors, dependency bumps, docstring passes, test generation, repetitive edits across many files, and translating a written spec into a draft change. It is mediocre at the part where you tell it “make the app feel right” and watch it thrash. It is bad at deciding what should exist in the first place.
The teams that get the most out of coding agents treat the brief like a real engineering specification, the diff like a real review artifact, and the deploy like a real production event. The teams that get the least treat the agent like a vending machine.
The three layers people conflate
When someone says “we have a coding agent,” they usually mean three separate things that got mashed into one product. Conflating them is the most common reason an agent deployment disappoints.
The model layer. This is the LLM itself, the brain. Different models have different strengths: some are great at long-context refactors, some are fast and cheap for small edits, some are good at following structured briefs. The model is the part most reviews obsess over, and it is the part that changes the fastest.
The orchestration layer. This is the loop that takes a brief, breaks it into steps, runs tools, observes results, retries, asks the human for help, and writes a scratchpad when the context gets long. The orchestrator is the part that decides whether an agent produces a clean, reviewable diff or a 2,000-line “while I was in here” change.
The runtime layer. This is the part that is usually invisible until something breaks. The runtime gives the agent a sandbox, a network, a set of secrets, a deploy target, and a way to read the build logs. The runtime is the part that decides whether the agent can ship at all, and whether shipping it is safe.
A team picking a “coding agent” is really picking a model, an orchestrator, and a runtime. The model gets the headlines. The runtime decides the outcome. This is the same shape as cloud hosting: the language gets the blog posts, the bill is in the storage and the egress.
The real cost of a coding agent
The first time a team runs a coding agent in production, the bill surprises them. Not because the model is expensive in the obvious “API calls cost money” way, but because the agent’s footprint is a stack of small costs that compound across a week of real work.
The cost stack looks like this in practice.
Model tokens per task. A repository-level task can burn through a million tokens on a tricky refactor. A simple docstring pass might cost a few thousand. The bill at the end of the month is the sum of these tasks, and it scales with how aggressive the team is about giving the agent long-running work.
Sandbox compute. The agent runs commands. Those commands run in a container or a VM, and that container costs money per minute. If the agent is bad at terminating processes, the sandbox cost becomes the line item that no one budgeted.
CI minutes. Every agent-generated change runs through CI. If the agent opens thirty pull requests a day, the CI bill is thirty times the per-PR cost. Most teams see the CI bill jump before they see the model bill.
Storage and artifacts. The agent keeps scratchpads, conversation histories, intermediate test artifacts, and sometimes its own vector store for repository context. None of these are large on their own, but they are large in aggregate.
Human review time. This one is invisible on the platform invoice and very visible on the team’s calendar. If the agent is producing diffs that take a senior engineer forty minutes to review, the labor cost dwarfs the platform cost.
A team that wants to use coding agents seriously should put a budget in front of the experiment, not after it. Track the model cost, the sandbox cost, the CI cost, and the review time per change. If the per-change cost is in the same order of magnitude as a senior engineer doing the same change, the agent is not yet earning its place. If it is one or two orders of magnitude cheaper, the experiment is paying for itself and the next move is to widen the scope.
For a sense of what the rest of the cloud bill looks like when the agent starts shipping real code, the RunxBuild hosting calculator is useful. The agent’s deploy target, its database, its build minutes, and its storage all show up as line items, and it is worth modeling them together rather than guessing.
The runtime an agent actually needs
This is the part of the agent conversation that has been a footnote for too long. A coding agent that runs in a chat tab is impressive for an hour. A coding agent that lives inside a runtime that gives it real tools is a different animal.
The minimum runtime an agent needs to be useful in production:
- A clean environment per task. A fresh sandbox with the right language, dependencies, and tooling, every time the agent starts a new task. State leaks between tasks are how security incidents start.
- Real tools, not toy ones. The agent should be able to run the actual test suite, the actual linter, the actual build, and the actual type checker. The feedback channel is the difference between an agent that pretends to ship code and one that actually does.
- Scoped secrets. The agent needs access to the deploy token, the test database, the staging API key. It does not need access to the production database or the customer’s data. The secrets live in the runtime, the agent gets placeholders, and the runtime injects the real values.
- A network boundary. The agent should be able to reach the registries, the package managers, the deploy target, and the internal APIs the team has decided are safe. It should not be able to reach arbitrary internet endpoints, exfiltrate code, or pull from a random S3 bucket.
- A log channel the agent can read. The build failed. The deploy rolled back. The health check timed out. The agent needs to see these messages in the same loop it uses for everything else, so it can react instead of guessing.
- A human-in-the-loop boundary. The agent opens pull requests, not merges. The agent proposes secrets, not values. The agent picks the deploy environment from a list, not by typing it in.
This is exactly the shape of a modern deploy platform, which is why the agent and the platform are starting to be designed together instead of bolted together. A team that already runs a clean deploy pipeline with health checks, rollback, and scoped secrets is most of the way to an agent runtime. A team that has been hand-rolling deploys with bash scripts and shared credentials has a lot of foundation to lay before the agent can be trusted with real work.
The RunxBuild deploy flow and build pipeline is the kind of substrate this is built on: Git push, build, deploy, health check, rollback, with secrets and environment variables kept out of the repo and the agent’s reach.
A deploy loop that does not bite you
The deploy loop is where most agent experiments quietly fall apart. The agent writes a clean diff, the CI is green, the merge happens, and then the deploy either fails, succeeds but breaks a health check, or succeeds and breaks something the test suite does not cover. The agent watches the deploy logs, sees the rollback, and writes a fix. The fix goes through CI, deploys, breaks the same thing in a different way, and rolls back again.
The fix is not a better agent. The fix is a deploy loop that gives the agent a clear signal fast.
A clean loop looks like this.
The agent opens a pull request. CI runs the test suite, the linter, the type checker, and a security scan. If anything fails, the agent is told, in the same loop, and is given the failure log. The agent fixes the diff and the cycle repeats.
When the diff is clean, a human reviews it. The human merges if the diff is correct. The merge triggers a deploy to a real environment with a health check response protocol that decides whether the new version is allowed to take traffic.
If the health check fails, the deploy rolls back automatically and the failure log reaches the agent as feedback. The agent proposes a fix, the human approves, and the loop runs again. The same 500 - internal server error playbook that helps humans debug a failing service helps the agent learn from the same logs.
The point of all this is that the agent should not have a separate deploy path. It should use the same platform, the same secrets, the same health checks, and the same rollback as any other deploy. A platform that makes that loop fast and observable turns the agent into a reliable teammate. A platform with a flaky deploy turns the same agent into an outage generator.
The platforms that get this right are the ones that treat the agent as a first-class deployer. The build pipeline, the environment variables, the storage, and the managed database are all reachable through the same interfaces the agent uses, with the same scoped credentials. The agent does not need a side door to ship code. It uses the front door.
Security, secrets, and the new attack surface
A coding agent that has read access to the production repo, write access to the deploy target, and shell access to a sandbox is a new kind of insider. Most teams are still working out how to think about that.
The most common incident pattern is a leaked secret. The agent sees the repository, the repository has a .env file with a real API key someone committed last year, and the agent now has the key in its working memory for the rest of the session. The fix is not smarter prompting. The fix is the same secret hygiene the team should already have: secrets in environment variables, secrets in the platform, secrets never in the repo, and a secret scanner that runs in CI before the agent has a chance to read anything.
The next pattern is a command injection. The agent reads an issue ticket, the issue ticket contains a snippet of “code” that is actually a prompt injection telling the agent to curl a URL and pipe the result into a shell. The agent, being an obedient tool, runs the command. The fix is the same as it has been for any system that processes untrusted text: treat the agent’s input as untrusted, run the agent in a sandboxed environment, scope its network, and watch the egress.
The third pattern is the one nobody talks about. The agent gets a real production credential by accident, uses it in a way that is correct but unfortunate, and ships a change that touches the production database. The fix is a credential scope so tight that the agent literally cannot do this, even if it tries. The platform should be the boundary. The agent should not hold the keys to the kingdom. It should hold the keys to the staging environment and a tightly scoped production read token.
The teams that handle this well are the ones that treat agent permissions the way they treat production access for any new hire. Scoped, time-limited, auditable, and revocable. The agent gets a credential, the credential has a name, the credential has a scope, and the credential can be rotated.
The economics of running your own agent
A pattern that has quietly become real in 2026 is teams running their own agent, on their own infrastructure, instead of subscribing to a hosted product. The reasons are not always cost. Sometimes the team has privacy requirements, sometimes they want full control of the model, sometimes they want the agent to integrate with an internal system that no vendor supports.
The economics are roughly the same shape as running a build server. The model is the variable cost. The sandbox is the fixed cost. The orchestration is the engineering cost.
If a team is running the agent on a regular compute instance, the sandbox cost is mostly the cost of the instance times the wall-clock time the agent is working. For a team that runs the agent all day, every day, on well-scoped tasks, this can be cheaper than a per-seat hosted product. For a team that runs the agent an hour a day, the hosted product is almost always cheaper.
The hidden cost is the integration. A hosted coding agent comes with integrations for the popular source hosts, the popular CI providers, the popular deploy targets, and the popular issue trackers. A self-hosted agent starts with none of those. The team that goes the self-hosted route is signing up to write or maintain a layer of glue code, and the layer of glue code is where the real engineering cost lives.
For teams that want a middle path, the services platform approach is worth looking at. The platform is the deploy target. The agent is the producer of code. The CI is the verifier. The platform handles the secrets, the env vars, the storage, the database, the build pipeline, and the rollback. The team writes the orchestration glue and picks the model. The bill is the sum of those parts, and each part is visible.
How to evaluate a coding agent for your team
The standard “best of” comparison is a benchmark leaderboard. Useful, but it does not tell you whether the agent will work for your codebase, your team, and your deploy story. A more useful evaluation has four parts.
The repository test. Give the agent a real task in a real part of your codebase. Not a toy. Not a greenfield. The same task you would give a new hire in their first week. Time the result. Read the diff. Run the test suite. If the agent needs a clean rewrite of your conventions to succeed, the agent is not yet a fit for your codebase.
The deploy test. Take the agent’s diff and deploy it through your real pipeline. Watch the build. Watch the health check. Watch the rollback. If the agent’s diff breaks your deploy in ways the test suite does not catch, the gap is in your pipeline, not the agent. The fix is a better pipeline, not a different agent.
The review test. Have a senior engineer review the diff without knowing it came from the agent. Time the review. The point is not whether the engineer notices. The point is how long the review takes. If the review takes longer than the engineer would have spent writing the change, the agent is not yet saving time.
The cost test. Run the agent for a week on real work. Track the model cost, the sandbox cost, the CI cost, and the review time. Compare the total cost per change to the cost of a senior engineer doing the same change. The agent is earning its place if the ratio is favorable and improving over time.
The right answer for most teams in 2026 is not the leaderboard winner. It is the agent that handles the repository test, fits the deploy loop, and produces a diff that does not terrify the reviewer.
What I would not outsource to an agent yet
A short, opinionated list, in case the post so far sounds too optimistic.
Authentication and authorization changes. The blast radius of a bad auth change is “everyone’s data.” Keep the agent away from the parts of the codebase that decide who can see what, and review those changes with the same paranoia you would give a junior with a copy of the production database.
Database migrations. The agent is great at writing the migration. The agent is bad at predicting what the migration does to a table with fifty million rows. Test the migration on a copy of production-scale data before the deploy, and have a rollback plan.
Anything that touches billing. The agent will not catch a mispriced SKU. The agent will not notice that the new endpoint is being called in a tight loop. The agent will happily ship a change that costs the company real money. The fix is a budget and an alert, not a smarter agent.
The “while I was in here” refactor. The agent is most dangerous when it is most helpful. A bug fix that touches thirty files is a code review problem, not a coding problem. The brief should always include “smallest possible change to make the test pass.” A change that does not respect that constraint is a change that does not get merged.
The “make it feel right” task. The agent does not know what feels right. It knows what is statistically likely. The result is a UI that is technically correct and emotionally wrong. Keep that work with the humans for another year at least.
FAQ
What is a coding agent?
A coding agent is a software system that combines a large language model with a planning loop, a set of tools (shell, editor, file system, web), and a feedback channel (tests, builds, deploy logs). The model proposes the next action, the tool runs it, the feedback returns, and the model decides what to do next. Unlike a code completion feature, a coding agent operates on a whole repository, can edit multiple files, and can run real commands.
Are coding agents worth it?
For small, well-scoped changes in a clean codebase, the productivity gain is real and immediate. For large, ambiguous changes in a legacy codebase, the gain is small or negative until the team has built a runtime that includes review, testing, rollback, and scoped secrets. The honest answer is “yes, for the right task, with the right runtime.” The marketing answer is “yes, always,” and that one is worth being careful with.
How much do coding agents cost?
The cost has four parts: the model tokens per task, the sandbox compute for the agent’s environment, the CI minutes for the agent’s pull requests, and the human review time. For a team running the agent all day on well-scoped work, the model and sandbox cost is often lower than a senior engineer’s fully loaded cost. For a team running the agent occasionally, a hosted per-seat product is usually cheaper than self-hosting. The right answer depends on the workload, not the vendor.
Can coding agents replace developers?
No. They can replace the part of the job that is typing code. They cannot replace the part of the job that is deciding what code to write, what the system should do, and how the next change will affect the rest of the product. The teams that get the most out of agents are the ones that point them at the typing and keep the deciding for themselves.
How do I deploy code written by a coding agent?
Through the same platform you would use for code written by a human. The agent opens the pull request, the CI verifies the change, the platform builds and deploys it, the health check decides whether the new version is allowed to take traffic, and the rollback is ready if the health check fails. The cleanest setups keep the agent out of the deploy path entirely, with the agent producing diffs and the platform producing deploys.
What are the security risks of coding agents?
The big three are leaked secrets in the repository, prompt injections in untrusted input the agent reads, and over-scoped credentials that give the agent more access than the task requires. The fix is the same hygiene the team should already have for any system with repo access: secrets in the platform, not the repo; a secret scanner in CI; the agent running in a sandbox with scoped network access; and credentials that are scoped, time-limited, and auditable.
What is the difference between a coding agent and an AI code assistant?
An AI code assistant suggests code in the editor as the developer types. A coding agent takes a goal, plans a sequence of actions, runs tools, observes results, and iterates. The assistant is a faster pair-programming partner. The agent is closer to a junior teammate who can run commands, read logs, and open pull requests on their own.