In 2025, the Google DORA State of AI-Assisted Software Development report found that 90% of developers use AI tools at work. That number is striking — until you look at what they're doing with them.
Most developers use AI reactively. They paste an error message and get a fix. They ask for a boilerplate function. They use autocomplete when it surfaces something plausible. Then they complain that AI isn't as transformative as advertised.
The problem isn't the tools. There's no workflow.
Developers who've systematically integrated LLMs into every phase of development — spec writing, code generation, review, testing, debugging — report a fundamentally different experience. In 2025, engineering teams at 100% AI adoption merged 113% more PRs per engineer per week and cut median cycle time by 24%, according to Jellyfish's 2025 AI Metrics in Review. That result didn't come from a better model. It came from a more structured workflow.
This guide walks through the six steps that produce it.
Key Takeaways
- In 2025, teams at 100% AI adoption merged 113% more PRs per engineer per week and cut median PR cycle time by 24% (Jellyfish, 2025 AI Metrics in Review)
- The biggest leverage points are spec writing, AI-assisted review, and test generation — not just code generation
- Persistent context layers produce more consistent LLM output than better prompting or better models alone
What You'll Need Before Starting
Before this workflow makes sense, a few things should be in place:
- An LLM with a large context window — Claude Sonnet/Opus, GPT-4o, or Gemini. Any of these work.
- An AI coding assistant in your editor — Claude Code, Cursor, or GitHub Copilot with file-level context access.
- A git-based PR workflow — branches, pull requests, code review. The AI steps plug into this structure; they don't replace it.
- Estimated setup time: 2–4 hours upfront, then incremental per feature.
- Difficulty: Intermediate. This guide assumes you've used an AI coding tool before and want to go beyond autocomplete.
What you don't need: a specific tool stack, a large team, or enterprise budget. This workflow scales from solo developer to a 20-person engineering team without modification.
Step 1: Build a Persistent Context Layer
By the end of this step, you'll have a single file that tells your LLM everything it needs to generate code that fits your project — without you re-explaining it every session.
The biggest waste in most AI-assisted workflows isn't bad prompting — it's context reconstruction. Every new session, the model starts cold. It doesn't know your architecture, your naming conventions, or the decisions made three months ago. You re-explain, the model guesses, and the output drifts from your codebase patterns.
The fix is a persistent context file. Here's what it contains:
# Project Context
## Architecture
[Tech stack, key modules, database schema summary — 2–3 sentences]
## Conventions
[Naming rules, file structure, what NOT to do — specific, not aspirational]
## Current Sprint
[What you're building, what's explicitly out of scope]
## Constraints
[Performance requirements, security rules, deprecated patterns to avoid]
How to wire it in:
- Claude Code — save as
CLAUDE.mdin your project root. It loads automatically on every session. - Cursor — save as
.cursorrules. - Any other tool — paste it as the first message in every new conversation.
The discipline that matters is keeping it current. A stale context file is worse than no context file — it confidently anchors the model to patterns that no longer exist in your codebase. Review your context layer at the start of each sprint the same way you'd review your backlog.
In 2026, the JetBrains HAX Study (n=800 developers, 151.9M logged events across two years) found that developers who reported the largest sustained productivity gains were also the most consistent about maintaining session-to-session context with their AI tools. The correlation wasn't with model choice or prompt complexity — it was with structured context.
Step 2: Write the Spec Before You Write the Code
In 2025, 73% of DevSecOps professionals told GitLab they'd experienced direct problems with vibe-coded output — code generated by describing an application to an LLM without a prior specification (GitLab, 2025 Global DevSecOps Survey, n=3,266, conducted by The Harris Poll). The fix isn't a more detailed prompt. It's a spec.
The spec step isn't primarily about constraining the LLM — it's a forcing function for you. Most AI-assisted bugs aren't model failures; they're prompt failures caused by engineers who didn't know precisely what they wanted before they asked. Writing a one-page spec forces that clarity before any code gets generated.
What the spec step looks like:
Before touching code, open a conversation and use this prompt pattern:
Help me write a one-page spec for this feature:
[describe the feature in plain terms]
Include:
1. What it does and what it explicitly doesn't do
2. The data involved and its shape
3. The API contract — inputs, outputs, error states
4. Three edge cases worth handling explicitly
Read the output as an engineer. Push back on anything wrong. Add constraints the model missed. This takes 20 minutes and it prevents hours of debugging generated code that solved the wrong problem.
Once reviewed, the spec becomes the anchor for every subsequent step in this workflow.
Step 3: Generate Code in Bounded, Reviewable Units
A Microsoft Research controlled study found that developers using GitHub Copilot completed representative coding tasks 55.8% faster on average — but task success rate improved only from 70% to 78% (GitHub, Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness, 2023). Speed goes up dramatically. Correctness goes up modestly. That gap is bridged by structured generation, not better prompting.
"Bounded and reviewable" means: never ask an LLM to generate an entire module in a single prompt. Generate one function at a time, one component at a time, one migration at a time. For each unit:
- Provide the function signature or interface from your spec
- State the input/output contract explicitly
- Reference the pattern from your context layer you want it to match
- Generate, then review before moving to the next unit
Is this slower than one-shot generation? Yes, upfront. It's faster than debugging one-shot generation — and dramatically faster than debugging one-shot generation that's been used as the foundation for five more units.
The compounding review problem is real. Every unit you skip reviewing multiplies the review burden of everything built on top of it. Generate small. Review immediately. Move forward.
Step 4: Run AI Code Review Before Human Review
In 2025, AI code review agent adoption jumped from 14.8% to 51.4% across enterprise engineering teams — and teams at 100% AI adoption cut median PR cycle time from 16.7 hours to 12.7 hours, a 24% reduction (Jellyfish, 2025 AI Metrics in Review, enterprise telemetry). That growth happened because AI review catches a specific class of bugs that human reviewers reliably miss: not because humans are careless, but because these bugs are tedious to look for under time pressure.
Unhandled edge cases. Inconsistent naming across the diff. Missing error handling in the happy path. Security anti-patterns from misusing a standard library. None of these are conceptually hard to catch. They just don't get caught because they're boring — and AI review is tireless.
The structured review prompt that works:
Review this diff against:
1. The spec: [paste relevant spec section]
2. Edge cases not covered
3. Security concerns
4. Consistency with these conventions: [paste context layer]
Return a numbered list. For each issue: severity (blocking / non-blocking) and a one-line fix.
The structured output format matters. Prose review is hard to act on. A numbered list with severity labels is a checklist. Fix blocking items before you open the PR for human review. List non-blocking items in the PR description as "known trade-offs" or "planned follow-ups." This tells human reviewers exactly where to spend their attention.
According to Jellyfish's 2025 AI Metrics in Review, code review agent adoption grew from 14.8% to 51.4% between January and October 2025 alone. Teams that added AI review saw more consistent cycle time improvements than teams that only used AI for code generation — suggesting the review step is where workflow discipline translates most directly into delivery speed.
Step 5: Generate Tests from the Spec, Not from the Code
In January 2026, Salesforce published results from using Cursor AI for test generation across 76 legacy repositories: coverage effort dropped from 26 engineer-days per repository to just 4 — an 85% reduction — while PR velocity increased more than 30% (Cursor, Salesforce Engineering Case Study, January 2026, n=20,000+ Salesforce developers on Cursor).
The critical detail in that result: tests were generated from specifications describing expected behavior, not from existing implementations. Why does the distinction matter? Tests generated from code test what the code does — including its bugs. Tests generated from specs test what the code is supposed to do, independently of how it was implemented.
The test generation workflow:
- Take a behavior from your spec: "Given a user with role VIEWER, accessing the admin endpoint should return 403."
- Prompt: "Generate test cases for this behavior. For each case: input, expected output, and the edge condition it covers."
- Review the test list. Add missing edge cases.
- Prompt: "Implement these test cases in [your test framework]."
- Run them. Fix failures in the implementation. If a test seems wrong, fix the spec first — not the test.
That last rule is non-negotiable. If you modify a test to make it pass rather than fixing the underlying behavior, you've broken the spec's integrity. The test is the contract.
Step 6: Debug AI Output Differently Than Your Own Code
In 2025, 66% of Stack Overflow survey respondents reported spending more time fixing "almost-right" AI-generated code than they'd saved in generation time (Stack Overflow 2025 Developer Survey, n=49,000+ developers). That's not a model quality problem — it's a debugging workflow problem.
When AI-generated code fails, the instinct is to re-prompt for a fix. That's usually the wrong first move. Re-generating a whole unit to patch a specific bug introduces new problems alongside the fix. It's like rebuilding a wall because of a faulty outlet.
Triage the failure first:
- Is the spec violated? The LLM misunderstood a constraint. Fix that spec section and regenerate just the affected unit.
- Is it an implementation bug? Provide the LLM with the error message, the failing test, and the specific function — not the whole file — and ask for a targeted fix.
Before re-prompting anything, ask the LLM to explain what it thinks the code does. Then compare that explanation to your expectation. Mismatches reveal the root cause faster than reading the error trace alone — and they tell you whether to fix the spec, the implementation, or your own mental model.
Treating LLM output as a junior engineer's first PR changes how you debug it. You'd explain the expected behavior to a junior engineer before asking them to fix a bug. The same approach works here and consistently finds the root cause faster than re-prompting from the error message alone.
Common Mistakes That Undermine the Whole Workflow
In 2025, Stack Overflow's Developer Survey found that developer trust in AI-generated code accuracy fell from 40% to 29% year over year — and positive favorability dropped from 72% to 60% (Stack Overflow 2025 Developer Survey, n=49,000+). Most of that erosion traces to predictable workflow failures, not model quality decline.
1. Skipping the spec step and going straight to code generation. This is the single largest source of "almost-right" AI output. Without a spec, the LLM optimises for coherent-sounding code, not correct-for-your-system code. Even a one-paragraph spec meaningfully narrows the model's output distribution toward what you actually need.
2. Treating one-shot output as production-ready. Every generated unit is a first draft. Review it before using it as the foundation for the next unit. The compounding cost of skipped reviews is non-linear — it doesn't add, it multiplies.
3. Accumulating AI tools without consolidating. In 2025, GitLab found that teams using more than 5 AI tools faced roughly 7 hours per team member per week in compliance and integration overhead (GitLab, 2025 Global DevSecOps Survey, n=3,266). Pick the minimum set of tools that covers each phase of this workflow. Resist the pull to add more.
4. Letting the context layer go stale. A stale context file doesn't fail loudly. It silently anchors the model to patterns that no longer match your codebase, producing output that looks right and fits wrong. Treat context layer updates as a first-class engineering task.
Frequently Asked Questions
Which AI coding tool is best for this workflow?
Any tool with file-level context access works: Claude Code, Cursor, and GitHub Copilot each have different strengths but support the same six-step structure. In 2026, JetBrains found Copilot, Cursor, and Claude Code each at 18–29% professional at-work adoption. Start with whatever your team already uses and apply the workflow structure around it.
Does this workflow work for solo developers, or only for teams?
It scales to both. The spec step and context layer are arguably more valuable for a solo developer — with no team to catch context drift, systematic structure does that job instead. The AI review step replaces some (not all) of the value of a second pair of human eyes.
How long does the context layer setup take for an existing project?
Roughly 45–90 minutes for an existing project, less for a greenfield one. The initial investment compounds: every subsequent session benefits from not having to re-explain your architecture. Most developers who do it once don't go back to cold-session prompting.
What if AI-generated tests pass but the feature is still broken?
This almost always means tests were generated from the implementation rather than the spec. Tests that reflect what the code does — including its bugs — don't catch regressions. Revisit Step 5: generate test cases from the spec's behavioral descriptions first, then implement them separately.
How do I use this workflow with proprietary or sensitive code?
Check your tool's data handling policy. Claude Code and enterprise tiers of Copilot and Cursor don't train on your code. For highly sensitive codebases, a self-hosted model (Llama 3, CodeLlama, or similar via Ollama) works with the same workflow — the structure doesn't depend on a specific model or provider.
Closing
The six steps here aren't complex: persistent context, spec before code, bounded generation, AI review before human review, spec-grounded test generation, structured debugging.
What separates the teams in the Jellyfish data — the ones who doubled PR velocity — from everyone else isn't which tools they used. It's that AI was built into the structure of their workflow at every phase, not treated as a shortcut they reached for when stuck.
Try one step this sprint. The context layer is the highest-leverage starting point. Everything downstream gets better once the model has something stable to anchor to.
Related Posts
Weekly Digest
Get the best AI engineering posts, weekly
No hype. Curated signal every Sunday.