Spec-driven development: writing specs that AI agents actually ship

Spec-driven development: writing specs that AI agents actually ship

Agents deliver sloppy code because they get sloppy briefs. A practical recipe for turning a Jira ticket into binary acceptance criteria, three-tier boundaries and self-verification.

Jakub Kontra
Jakub Kontra
Developer

The moment your agent says "done" and the build breaks

A scene you know. You get a jira ticket for a backend thing — nothing complicated, add retry logic to the payment flow for when a webhook from the PSP doesn't get delivered. You toss it to the agent, let it run, and go grab coffee. You come back, the diff looks reasonable, tests are green, the agent proudly reports "implementation complete, all tests passing". You push it, CI breaks at the integration level. You open the code and find out the agent invented an endpoint /webhooks/retry that doesn't exist in our API at all, added a feature flag nobody asked for, and the two tests that were giving it trouble it just deleted with the comment "removed obsolete test".

And this isn't a dumb-model problem. The model isn't dumb. The briefing was dumb.

The failure mode isn't hallucination — it's vagueness

Prompt engineering answers the question of how to tell the agent. Spec engineering defines what has to exist at the end — and what it gets measured against.

When you tune a prompt, you're still writing narrative instructions for a machine that fills in the rest on its own. A spec is a contract against which the result can be measured binarily. A jira ticket in the style of "add retry for webhooks" is guesswork wrapped in a sentence. The agent did exactly what the vagueness of the ticket allowed: where you didn't say anything, it made it up; where you wrote "retry logic", it picked the easiest of three interpretations. The primary failure mode of current agents — Augment Code talks about this in a series of spec-first posts from late 2025 — isn't hallucination, but the degree of freedom you permitted in the briefing.

A spec is a contract, not a document

Sean Grove from OpenAI at AI Engineer World's Fair in June 2025 said one sentence: code is 10–20% of the value a programmer delivers. The rest, 80–90%, is structured communication — the ability to write a spec that unambiguously describes what should come into being. His advice sounds terse: "start with the specification." The OpenAI Model Spec has clauses with unique IDs and example prompts that function as unit tests for the spec itself.

Kent Beck in the Pragmatic Engineer podcast 2025 adds that TDD is a "superpower" with AI agents, but at the same time warns that agents delete tests to push them into the green. A spec leaning on tests alone isn't enough. You need a contract on multiple layers — machine-verifiable, with binary acceptance criteria and explicit guardrails. Not a Word document you write at the start and nobody reads for two sprints.

What a ticket the agent can't screw up looks like

Take an idempotent POST endpoint for retrying a payment webhook. In narrative form the ticket would read: "Add an endpoint that takes a webhook event ID and triggers a retry, make sure it's not duplicate." The agent will pick from that whatever it wants.

Recast it into binary criteria. Instead of "works correctly" you write: POST /webhooks/{event_id}/retry returns 202 when the event exists and is in the failed state. Returns 409 when it's in succeeded or processing state. Returns 404 when the event ID doesn't exist. Returns 429 when the same event ID arrives more than three times in 60 seconds. The idempotency key is event_id — a second call within 5 minutes returns the cached response, not a new retry. Each of those lines is a check you either pass or don't. Addy's framework from "How to write a good spec for AI agents" nails it: Returns 401 when unauthenticated yes, Works correctly never.

Below the acceptance criteria attach three-tier boundaries, the way Addy Osmani pushes them. The split has three layers because the agent needs to know three things: defaults, where to escalate, and where there's a hard ban. Without that third layer it'll add whatever it sees fit.

Always do: use the existing WebhookEventRepository. Log through structlog with event_id in context. The integration test belongs in tests/integration/webhooks/. And handle the idempotency key through redis-cache, which is already configured.

Ask first: any change to the webhook_events table, adding a new dependency to pyproject.toml.

Never do: don't delete existing tests. Don't create new endpoints outside /webhooks/*. Don't add feature flags. Don't introduce a new retry mechanism — tenacity is in the project.

On top of that, self-verification. After implementation the agent runs ruff check, mypy, pytest tests/integration/webhooks/ -v and has to copy the terminal output into the completion report. Not "all tests passing", but copied terminal output. Addy adds LLM-as-a-Judge for non-numeric criteria like readability and consistency with code style. That makes sense for things where you can't write an assert.

The last layer, which Claude Code supports natively: SPEC.md workflow. Claude Code interrogates you through AskUserQuestion, generates SPEC.md, and execution runs in a fresh session where the agent doesn't see your original narrative babbling — only the contract. The spec author and the executor are separated.

Tools that work today

If you have to pick one tool today, take GitHub Spec Kit. It came out in September 2025, has over 40,000 stars, holds the commands /specify, /plan, /tasks and supports 30+ agents including Claude Code, Copilot and Cursor. It's the closest thing to a standard you've got right now, and it doesn't force an extra platform on you.

AWS Kiro (preview in July 2025, GA in November 2025) is an agentic IDE built on SDD, where the spec is the source of truth. If you live in AWS, it's worth a try. If not, you're adding a platform for a methodology you can handle in Claude Code with SPEC.md just as well. Cursor rules (always, agent-requested, manual) are a lightweight version of three-tier boundaries — fine for a smaller project, on a bigger one you want SPEC.md per-feature, not just a global .cursorrules.

Martin Fowler in "Exploring Gen AI: Spec-Driven Development" distinguishes three variants: spec-first (the spec comes first, the code from it), spec-anchored (the spec exists in parallel, serves as an anchor) and spec-as-source (the spec is the only authoritative artifact, the code gets regenerated). Most teams end up at spec-anchored and shape it to their own needs.

The Fowler warning nobody will tell you

Fowler's experiment from August 2025 showed that even with a detailed spec, agents "added unrequested features, changed assumptions mid-stream, reported success on a failed build". Beck adds that agents delete tests that fail.

Self-verification with copied terminal output protects you against a bogus "all tests passing". LLM-as-a-Judge covers what you can't verify with an assert. Simon Willison proposes YAML conformance suites as a machine-parsable contract between spec and agent — validator, not trust. And Willison's rule goes: don't push code you couldn't explain to someone else. When the agent delivers something you don't understand yourself, the spec was loose.

A hybrid approach beats both the pure ideology of vibe-coding and spec-as-source puritanism. It means spec as contract, human judgment and safeguards.

What to do Monday morning

The agent isn't a junior you do a code review on after every file. It's a vendor. It delivers exactly what's in the contract and not a comma more. If what it's giving back annoys you, stop writing better prompts and start writing contracts that can't be fulfilled any other way than correctly.

One concrete change for Monday: take the nearest ticket you plan to toss to the agent and before you send it, rewrite the acceptance criteria into binary form. Replace every "works" with a specific status code, error message or state in the DB. Add three lines of "Never do". Have the agent copy the output of pytest into the response. You'll notice it in the first diff.

SDD is leverage. It flips the burden of proof: you're no longer the one policing, the agent has to deliver. Fowler's right that it isn't free — safeguards and reviews cost you something.

If you want to roll this workflow out across a team from the ground up — CLAUDE.md, hooks, spec templates and code review procedures — I run on-site trainings on exactly this. me@jakubkontra.com