LLM agents in the shipping loop

For the last six months I've been wiring LLM agents directly into the loop of how I ship software. Not as autocomplete. Not as chatbots. As autonomous teammates that read tickets, write PRs, and ping me when something is off.

This post is what I've learned, what's actually working, and what I'd do differently.

The setup

The agents have access to:

A read/write working copy of the repo
The ticket tracker
A staging environment they can deploy to
A scoped read-only DB connection
Team chat access (read-only in most channels, write access in one dedicated agent channel)

They cannot:

Merge to main without a human approver
Touch production
Read or send DMs
Send emails

That bound is non-negotiable. Once you give an agent write access to anything customer-facing, you've signed up to debug it at 3am.

What works

1. Triaging issues

Inbound bug reports get a first pass from an agent. It checks the stack trace against recent commits, reproduces locally if possible, and either suggests a fix as a comment or routes the ticket. Easily 70% accuracy on the routing alone, and that's already worth it.

2. Generating tests

The agent reads a PR, infers what changed, and proposes integration tests. Most are wrong on first pass, but the structure of what to test is almost always right. I edit, I commit, I keep moving.

3. The boring stuff

Renames. Type-hint backfills. Migrating from one logger to another across 200 files. The kind of work that used to take a sprint takes a coffee.

What doesn't

Anything that requires actual product judgment.

The agent will happily write a feature that "passes the tests" and is completely wrong. Not subtly wrong. Like, building-on-the-wrong-side-of-the-road wrong. So I never let it own a feature end to end. It does steps. I do the connective tissue.

The mental model

I think of an agent like a junior contractor who reads infinitely fast and types infinitely fast but has no taste. Pair it with someone who has taste, and you get a force multiplier. Leave it alone, and you get a beautifully-formatted disaster.

What I'd do differently

If I were starting today:

Start with the smallest possible scope. A single repo. A single agent. One job. Expand only when the seams stop tearing.
Log everything. Every prompt, every response, every diff. You'll need it the first time the agent does something inexplicable, which will be by Tuesday.
Build the off-switch first. A single keyboard shortcut that pauses every agent in your stack. Day one.

Next week: how I'm using a small local model to summarize internal chat channels into a daily brief.