Skip to content
Writing

How to run an engineering team of one

A practical setup. The patterns that hold. The patterns that don't.

The setup

A solo operator can ship seven features and fifty subtasks in a single day. The catch: none of the commits will have a human author.

This is not a story about being faster. It is a different shape of work. The operator stops pulling feature branches. They stop running git checkout. What they do, full-time, is write tickets, review pull requests, and make product calls.

The team is four agents:

These are not "AI assistants" in the chatbot sense. They are named identities with their own GitHub accounts, their own SSH keys, their own private repositories of accumulated context, their own commit histories under their own names. The human directs. The agents do the work.

What follows is what makes that arrangement actually work, and what it does not solve.

The shift

The hard thing about this setup is not the wiring. The wiring is a weekend.

The hard thing is that the operator stops being a developer.

Every habit built over a decade — opening the IDE, branching, debugging in the editor, "let me just check this real quick" — is now wrong. The right move is almost always to write a precise ticket and let the agent take it. Value sits in the precision of the ticket and the rigor of the review, not in keystrokes.

Old habits route around the new system. The operator finds themselves opening files "to check something" and twenty minutes later has done a fix that should have been an agent's. The cost is not the twenty minutes. The cost is that the agent's playbook does not grow to handle that case the next time.

If you do this, structurally remove yourself from the keyboard. Close the IDE during agent runs. Be the PM, not the dev who occasionally wears a PM hat.

The mechanics

Four pieces. None are exotic. The combination is what matters.

1. Specialized agents with persistent identity

Don't run one general "AI assistant." Run named, role-specialized agents.

Each agent has its own private repo on GitHub holding its skill (the system prompt, ~500 lines), its journal, its accumulated heuristics, its templates and playbooks. When an agent runs, it pulls its repo first, reads its skill, and proceeds. When a session produces a useful pattern, the agent commits it.

Six weeks in, the heuristics file grows enough to materially shape behavior. The PM agent spots vague tickets the way a senior PM would — because the agent has been corrected on three of them and the rule is now in its skill.

2. Cryptographic identity per agent

Each agent has its own GitHub account, its own SSH key, its own personal access token. Identity is determined by which directory the work happens in, not by global state, via a gitconfig rule that activates per-directory:

[includeIf "gitdir:~/dev/agents/<agent>/"]
    path = ~/.gitconfig.<agent>

Inside each per-agent gitconfig: the agent's name, email, SSH host alias rewrite. Combined with ~/.ssh/config host aliases pinning each agent to its own key. Net effect: every commit and push from an agent's directory is automatically attributed to that agent's GitHub identity, with git log showing Author: <Agent name>.

Why this matters: audit trail is mechanical, not policy. Three agents pushing concurrently cannot clobber each other's identity. git blame tells you which agent made which call. When (not if) one of them ships a bug, you know which one.

3. Git as the coordination substrate

Multi-agent systems usually solve coordination via direct memory sharing or message passing. Git works better.

For each new ticket that takes more than a one-line answer, the PM agent opens a pull request in a shared coordination repo. The PR's working document grows over the ticket's life: triage section from the PM, architecture section from the architect, implementation diary from the developer, verification rounds appended by the PM. The PR's comment thread carries the back-and-forth between agents. When the work merges in the main code repo, the coordination PR closes.

Why git instead of a real bus: versioning, conflict resolution, and audit are free. PR notifications give the human a watchable feed of agent activity without subscribing to noisy ticket comments. Concurrent writes surface as merge conflicts to resolve, not as silent overwrites. And when something goes wrong six weeks later, the entire reasoning trail for that decision is in git history — queryable, attributed, indelible.

4. The human is the PM

The human runs no daily code work. Their day is:

No code. No pulled branches. No CI fixes. The agents do those. The human's job is precision and judgment.

Patterns that hold

Tickets are the contract

The single biggest determinant of good output is the ticket. Vague tickets produce vague code. The PM agent — which triages everything before any other agent sees it — saves the human from their own bad tickets more than any other intervention. The agent rewrites them, drafts crisper acceptance criteria, asks for missing examples, occasionally bounces a ticket back with "this isn't ready."

A recurring case: an operator opens a ticket like "fix categorization detection." Useless. The PM agent correctly refuses to act and asks for example URLs that should categorize as X but currently don't. After the third correction, this becomes a hard rule in the agent's skill: any extraction-quality bug demands ground-truth example URLs in the ticket before routing. The human stops opening that kind of ticket. The agent stops having to ask.

The PM agent is the quality gate, not the developer

For every ticket that touches user-observable behavior, the PM agent verifies the output against the acceptance criteria it drafted, before the PR can move forward. It runs the app on a real device, walks the AC bullet by bullet, captures evidence, approves or rejects.

Most rejections are catches that automated tests would never find.

A redesign of one card. Fonts matched, layout matched, all unit tests green. The PM agent rejected three times in a row. On the third rejection it pasted two images side by side: the mockup and a screenshot of the implementation. The placeholder tile silhouette in the mockup had small "tab" shapes — distinctive enough to read as part of the brand. The implementation had flattened them to plain rounded rectangles. Tests don't catch shape. The architect didn't catch shape (it worked from a text spec). The developer didn't catch shape (it matched the spec, not the mockup). The PM verification, run as a discipline of placing the mockup and the implementation in the same view, did.

The lesson is structural, not stylistic: an automated quality gate catches what a single developer wouldn't, because the verification is run by a different role than the implementation, against the original artifact rather than its derivative.

A single human cannot do this for themselves. They cannot be both implementer and reviewer effectively in the same head. Two specialized agents make it cheap.

Agents disagree productively

Twice in six weeks an agent has rejected another agent's work on architectural grounds — saying not "this implementation has a bug" but "this design is wrong, kick it back to the architect." The escalation happens entirely between agents in PR comments and surfaces to the human as a single Jira reassignment. By the time the human looks, the architect has revised the breakdown and the developer is already implementing the new shape.

The human does nothing. Two specialized agents resolve a disagreement that, in a one-human-with-AI setup, the human would have had to mediate.

Per-agent state compounds

Each agent has a journal it appends to and a "heuristics" file where it stages observations that have not yet earned a place in its skill. After three confirmed cases of an observation, it earns promotion via a PR the human reviews. After 90 days unpromoted, it drops.

The PM agent's heuristics file currently has a rule about reporters being vague when describing classification bugs. That rule was earned over four tickets. It now fires before the human could even notice the pattern. Compounding context is the difference between an agent that's helpful for a session and a teammate that gets better.

Patterns that don't

Don't share state across agents

Early designs often consider a single shared memory store across agents. Bad idea. Each agent has different operating concerns, and reading the others' raw notes adds noise without adding signal. Keep state private. Share only promoted lessons through a coordination surface — the shared repo's lessons/ directory — and only after a quarterly retro decides what's general enough to publish.

Don't trust an agent that doesn't read the comments

The single most expensive failure mode: an agent picks up a ticket, reads only the description, derives a fix, and acts. Half an hour and 50,000 tokens later, the verified diagnosis was in comment 4 of 7 and a chosen approach was in comment 6.

The fix is structural, not motivational. Every agent's skill should have a hard rule, the first action on any ticket pickup: read every comment in chronological order, read the parent ticket and linked tickets, restate the prior decisions in the first response so the human can see comments were read. Past decisions are binding unless explicitly overruled. This rule fires before any other reasoning.

After the rule is added, the failure mode disappears. The lesson generalizes: when an agent makes a recurring class of mistake, the fix is to bake the prevention into its skill, not to remind it more carefully each time.

Don't let agents commit as the human

A non-obvious failure: an agent commits under the human's email because the global user.name and user.email defaults were copied too aggressively. The audit trail blurs — git blame shows the human as author for code they have never seen. Worse, the SSH keys involved cannot push to the right places, so things break in confusing ways.

Per-agent identity is structural, not cosmetic. Each agent commits as itself. Anything else is a footgun.

Don't ask agents to summarize their own work

"Generate a summary at handoff" produces useless output. The summary is either a redundant restatement of the diff or a sales pitch for the agent's own output.

Replace it with a structured handoff briefing: what was learned, what was ruled out, open questions for the next agent, what to do first. Distilled state, not narrative — the receiving agent skims four bullets and moves.

What it costs

Be honest about this part.

Setup time. Roughly two days the first time. Bot GitHub accounts, SSH keys, gitconfig with directory-based identity rules, per-agent state repositories, MCP server configurations, the coordination repo, the runbook. Not a weekend; a deliberate two days. Once.

Time to maturity. Two days puts the infrastructure in place. The pipeline that behaves like a competent team takes months. The first weeks, every agent ships work that gets bounced back. Skills accumulate scar tissue from each correction: a missed mockup detail becomes a hard rule about visual-fidelity verification; a vague ticket becomes a hard rule about ground-truth examples; a skipped comment thread becomes a blocking prerequisite. After two months of this trial-and-error refinement, the pipeline operates at the level of a strong human team. It did not start there. The agents you stand up on day one are not the agents you have on day sixty — and the larger, slower-compounding investment is the one you make in correcting them, not the one you made in deploying them.

LLM costs. The agents run on the strongest model available. The PM and architect use extended thinking by default. A typical day of light operation runs $30–60 in API costs. A heavy day with a feature launch is $120+. Materially less than one engineer's hourly rate, but not zero.

Mental load. This surprises operators who expect to feel less tired. Eliminating IC work doesn't eliminate effort, it shifts it to PM and review work. The first three weeks are often more tiring than writing code, because writing code is auto-pilot for an experienced developer and triage is not yet. By week six it settles.

Risk surface. The agents have judgment, and sometimes their judgment is wrong. The PM agent has misclassified tickets. The architect has proposed approaches that violated patterns the codebase had converged on. The developer has implemented to spec when the spec was wrong. In each case the structure caught it: a subsequent agent rejected, the human caught at review, the verification surfaced the bug. But the structure only catches what it catches if the human reviews every PR, every triage call where they're cited, and every architectural proposal. If review stops, course-correction stops.

If you try it

Start with a real product, not a demo. The compounding only matters if there's actual work flowing through.

Start with two agents, not five. A PM and a developer is enough to show whether the operating model fits the way you think. Architect, debugger, content can come later.

Build per-agent identity from day one. It is a multiplicative cost to add later.

Use git as the coordination plane. Anything else is harder to audit.

Eliminate yourself from the keyboard for the first week. The hardest part of this is not the wiring. It is the role-shift.