What Breaks When You Skip the Harness

2026-06-23 #AI

The article discusses how skipping the harness—files, rules, and tools that wrap a model in a real project—leads to repeated model errors. It highlights that the model isn't the main issue, but the lack of proper harness components like documentation and feedback loops. The article suggests small, actionable fixes like adding a MCP server or creating a feedback.md file to address these issues. It emphasizes the importance of codifying team knowledge and workflows to improve model output quality. The key takeaway is that the harness is critical for ensuring consistent and accurate model behavior.

Show original excerpt (English · first 3 paragraphs)

8 min readJust now--The agent kept running the tests, watching them go red, scrolling up to find the failure, and then running the tests again because it had already lost the output. Third time in one session. The model wasn’t the problem. The session’s memory of its own output was. I had Claude tee the test command to a log file. After that it read the log instead of re-running.That was one fix on one project. The pattern repeats. If the model keeps producing bad code, what is actually broken?The model isn’t the problemOutput quality is the sum of two things: the model and the surroundings.In the teams I’ve worked with, the obsession is always the model. They upgrade. Switch tools. New IDE plugin every month. They read every benchmark. The defects keep landing in the same places.The surroundings get almost no attention. Those surroundings are the harness: the files, rules, tools, and feedback loops that wrap the model inside a real project. A CLAUDE.md with conventions. Skills — codified workflows the agent can run by name. MCP servers. Scoping rules. A feedback.md that records what the team has corrected. Pre-commit gates.Without those, the model is guessing.Five things teams blame on the model, each pointing at a missing piece of the harness, each with a first move you can ship this week. Some of the bigger fixes (codifying a team’s lore, writing a useful project-specific MCP, etc.) are quarter-of-work to do properly. The first move is week-sized. The first move gets you started.APIs that don’t existThe model writes a call to client.users.list({ since: lastSyncAt }). The argument doesn’t exist. The real API takes a Unix timestamp on a different endpoint. Code compiles. Test fails. The engineer reads the docs. The bug gets filed as “the model hallucinated.”Except the model didn’t hallucinate from nothing. It wrote what it learned from training data and didn’t verify. Nothing in the session told it to.The fix is grounding. Give the model a way to read the real docs in the moment.Two grounding moves do the work.First: an MCP server for library docs — Context7 is one. The model fetches current docs and writes against what’s there. When it fails, it’s almost always because the rule telling the agent to fetch first wasn’t in CLAUDE.md.Second: a project-specific MCP for internal APIs. Every team I’ve worked with has one or two services nobody got around to documenting well. An MCP that serves the OpenAPI spec — or a search over the service’s source — puts real signatures in front of the model.A single rule in CLAUDE.md ties it together. “Fetch docs before writing library code.”Smallest first move: add Context7. Add the rule. Try it on the next library task.Code that doesn’t match yoursThe model writes a new service. The file lands in the wrong directory. Errors throw instead of returning. The naming is camelCase in a snake_case codebase. The diff looks foreign.The reviewer sees it and rewrites it. The next PR has the same problems.The model has no reference for “how we do it here.” The rules live in three engineers’ heads and one out-of-date README.Part one is a CLAUDE.md that names the project’s conventions in plain words. Where files go. How errors propagate. Naming style. Testing framework. Logger. Time library. One or two sentences each. The bare minimum.Part two matters more: worked examples.When we moved a service to hexagonal architecture, the agent kept writing new code on the wrong side of the ports-and-adapters boundary. The old code was still everywhere and gave it permission to keep doing what it had always done.The model copies texture from examples better than from prose. A skill or a feature doc with one full worked example , such as a real PR, a real file, or a real test, gives it something to imitate. Three examples beat a thousand words of rules.The catch: if the codebase itself is inconsistent (legacy module is snake_case, new code is camelCase), the model picks up the contradiction and writes both. The worked example has to come from the side of the codebase you want the agent to copy.Smallest first move: pick the workflow your team runs most often. Write one skill for it. Drop one full example into the file.The same bug, in every sessionYou correct the agent on Monday. It used the wrong logger. You tell it to use the project’s logger. It does. Friday, new task, wrong logger again.The agent didn’t learn. The session ended. The correction went with it. The fix is a feedback loop the agent can read. Not a chat log, but a file.Create a feedback.md. When the team corrects the agent on something that’s a rule, not a typo, write it down in three lines. Context. Mistake. Rule going forward. Tell the agent to read it before every relevant task.Get Ian Johnson’s stories in your inboxJoin Medium for free to get updates from this writer.Remember me for faster sign inThe first entry in mine was about flaky E2E tests. The team’s rule was fix them or delete them. Never live with them. The agent kept patching around them instead. After the third time I corrected it I stopped repeating myself and wrote the rule down.Who writes to it: anyone on the team. The cost of adding an entry has to be smaller than the cost of fixing the same thing twice. Otherwise nobody writes.The promotion rule: when the same correction shows up in feedback.md three times, lift it into CLAUDE.md or into a skill. Three is arbitrary, pick a threshold and stick with it. I chose three because of the rule of three. The point is that nothing useful stays buried in the feedback file forever. Every promoted entry is one more rule the model reads at the start of the next session.Smallest first move: create the file. Write one entry today. Add one line to CLAUDE.md that says read it first.The knowledge stays in headsA senior engineer is on a call, explaining why the payment service has its own retry logic that is different from the rest of the system. The new engineer nods. The agent in everyone’s terminal doesn’t, because the agent wasn’t on the call.Two weeks later the agent writes a new payment integration. It calls the shared retry helper. It has no idea there’s a reason not to. The PR ships. The on-call gets paged.This is tribal knowledge. The lore of the system. It lives in Slack scrollback, in senior engineers’ heads, in a Notion or Confluence page.The fix is codification. Three kinds of files do the work.A glossary captures domain vocabulary. The word “session” can mean three different things in one system — auth, user, analytics. The glossary picks one definition per term and points to it. One paragraph each.A feature doc holds the why of a part of the system. Why payments has its own retry logic. Why we picked Postgres over MySQL. Why the queue uses Redis Streams and not Kafka. Short notes a new reader can scan in a minute.Skills hold workflows. The steps to ship a database migration. How we add a feature flag. The onboarding checklist for a new service joining the monitoring stack.The hard part of this fix is not the format. The hard part is getting the senior engineer to write the feature doc instead of explaining it on calls. That’s a social problem, not a documentation problem. Pair on the first one. Make the doc the deliverable.A billing module on one of my projects had quiet invariants about how transaction state combined. The agent made an arithmetic change that looked clean in the diff. It would have produced wrong totals in production. The reason the change mattered wasn’t in the code. It was in the team’s heads.Smallest first move: pick the one piece of lore most often re-explained to new engineers. Write it down as a feature doc this week. Reference it in CLAUDE.md.Reviewers as the harnessEvery PR re-litigates the same things. Naming, test coverage, error handling, conventions, and so on. Same three reviewers, same six comments. PRs take two days to merge. The reviewers are tired.The reviewers have become the harness. Humans are doing what a script should do.The fix is gates the machine can run.A pre-commit hook catches the same six review comments before the diff exists. Match the gate to the team’s actual repeated comments. It could be naming-convention nits that needs a linter rule, missing-test-file nits that needs a coverage gate, or risky-file-changed nits that needs a CODEOWNERS-style script. The point is to read the team’s review history and turn the top three repeat offenders into machine checks.A review skill or review agent reads the diff and flags the things a linter can’t see.A code-health check (CodeScene is what I use) gives a numeric score for maintainability and blocks regressions. The score settles the argument over whether a diff makes the code worse in seconds.The worst case I’ve watched was naming inconsistencies during a migration. Two valid styles in the codebase, the model picking both, sometimes in the same PR. We added sensors that flagged the old style in new files and ran them in CI. The drift stopped showing up in review.The work reviewers should be doing (design, intent, what the PR is actually trying to do) gets crowded out by everything machines could catch and don’t.Smallest first move: pick the review comment your team has left most often this month. Write one pre-commit hook for it. Push it in a PR this week.Write three rules this weekThe model isn’t the problem. The surroundings are the work.Three things to do this week.1. Open CLAUDE.md. Write three rules: the three things you’ve corrected an agent on this week. That’s the start of a harness.2. Create feedback.md. Write one entry. Tell the agent to read it. The feedback loop now exists.3. Pick one workflow you’ve run through an agent more than three times this month. Make it a skill. The third time you reach for it, it’s already written down.The harness rots too. CLAUDE.md drifts as the project changes. Skills go stale. feedback.md swells to four hundred entries and turns into noise. Two skills contradict each other and the agent picks one at random. Most teams who tried a harness and quit do so at the rot stage, not the setup. Treat it like the code it wraps: audit on a schedule, delete what no longer applies, promote what has earned its place.This isn’t the only thing that matters. A great model still beats a mediocre one on hard problems. I don’t have a clean benchmark for the bigger claim. What I have is a preference: given the choice between a great model in a bad harness and a mediocre model in a good one, I take the mediocre model every time. The model’s mistakes I can debug. The drift I cannot.Watch for two weeks. The defects don’t vanish. Instead they show up earlier at pre-commit, in the feedback file, or in the review skill instead of in CI or production. That’s the win.

* For copyright reasons we quote only the first 3 paragraphs. Read the full article at the source.

— Read original ↗

Read original ↗