AI Engineering

The Modern AI-Assisted Dev Workflow: How to Use LLMs for Coding, Review, and Testing

May 26, 2026·11 min read

Developer ToolsLLMs

In 2025, the Google DORA State of AI-Assisted Software Development report found that 90% of developers use AI tools at work. That number is striking — until you look at what they're doing with them.

Most developers use AI reactively. They paste an error message and get a fix. They ask for a boilerplate function. They use autocomplete when it surfaces something plausible. Then they complain that AI isn't as transformative as advertised.

The problem isn't the tools. There's no workflow.

Developers who've systematically integrated LLMs into every phase of development — spec writing, code generation, review, testing, debugging — report a fundamentally different experience. In 2025, engineering teams at 100% AI adoption merged 113% more PRs per engineer per week and cut median cycle time by 24%, according to Jellyfish's 2025 AI Metrics in Review. That result didn't come from a better model. It came from a more structured workflow.

This guide walks through the six steps that produce it.

Key Takeaways

In 2025, teams at 100% AI adoption merged 113% more PRs per engineer per week and cut median PR cycle time by 24% (Jellyfish, 2025 AI Metrics in Review)

The biggest leverage points are spec writing, AI-assisted review, and test generation — not just code generation

Persistent context layers produce more consistent LLM output than better prompting or better models alone

What You'll Need Before Starting

Before this workflow makes sense, a few things should be in place:

An LLM with a large context window — Claude Sonnet/Opus, GPT-4o, or Gemini. Any of these work.
An AI coding assistant in your editor — Claude Code, Cursor, or GitHub Copilot with file-level context access.
A git-based PR workflow — branches, pull requests, code review. The AI steps plug into this structure; they don't replace it.
Estimated setup time: 2–4 hours upfront, then incremental per feature.
Difficulty: Intermediate. This guide assumes you've used an AI coding tool before and want to go beyond autocomplete.

What you don't need: a specific tool stack, a large team, or enterprise budget. This workflow scales from solo developer to a 20-person engineering team without modification.

A developer's workspace viewed from above showing multiple monitors with code, representing a structured AI-assisted software development environment

Step 1: Build a Persistent Context Layer

By the end of this step, you'll have a single file that tells your LLM everything it needs to generate code that fits your project — without you re-explaining it every session.

The biggest waste in most AI-assisted workflows isn't bad prompting — it's context reconstruction. Every new session, the model starts cold. It doesn't know your architecture, your naming conventions, or the decisions made three months ago. You re-explain, the model guesses, and the output drifts from your codebase patterns.

The fix is a persistent context file. Here's what it contains:

code

# Project Context

## Architecture
[Tech stack, key modules, database schema summary — 2–3 sentences]

## Conventions
[Naming rules, file structure, what NOT to do — specific, not aspirational]

## Current Sprint
[What you're building, what's explicitly out of scope]

## Constraints
[Performance requirements, security rules, deprecated patterns to avoid]

How to wire it in:

Claude Code — save as CLAUDE.md in your project root. It loads automatically on every session.
Cursor — save as .cursorrules.
Any other tool — paste it as the first message in every new conversation.

The discipline that matters is keeping it current. A stale context file is worse than no context file — it confidently anchors the model to patterns that no longer exist in your codebase. Review your context layer at the start of each sprint the same way you'd review your backlog.

In 2026, the JetBrains HAX Study (n=800 developers, 151.9M logged events across two years) found that developers who reported the largest sustained productivity gains were also the most consistent about maintaining session-to-session context with their AI tools. The correlation wasn't with model choice or prompt complexity — it was with structured context.

Code displayed on a developer's terminal screen with syntax highlighting, representing the persistent context layer driving consistent AI output across sessions

Step 2: Write the Spec Before You Write the Code

In 2025, 73% of DevSecOps professionals told GitLab they'd experienced direct problems with vibe-coded output — code generated by describing an application to an LLM without a prior specification (GitLab, 2025 Global DevSecOps Survey, n=3,266, conducted by The Harris Poll). The fix isn't a more detailed prompt. It's a spec.

The spec step isn't primarily about constraining the LLM — it's a forcing function for you. Most AI-assisted bugs aren't model failures; they're prompt failures caused by engineers who didn't know precisely what they wanted before they asked. Writing a one-page spec forces that clarity before any code gets generated.

What the spec step looks like:

Before touching code, open a conversation and use this prompt pattern:

code

Help me write a one-page spec for this feature:
[describe the feature in plain terms]

Include:
1. What it does and what it explicitly doesn't do
2. The data involved and its shape
3. The API contract — inputs, outputs, error states
4. Three edge cases worth handling explicitly

Read the output as an engineer. Push back on anything wrong. Add constraints the model missed. This takes 20 minutes and it prevents hours of debugging generated code that solved the wrong problem.

Once reviewed, the spec becomes the anchor for every subsequent step in this workflow.

Source: JetBrains Developer AI Tools Survey, January 2026

Step 3: Generate Code in Bounded, Reviewable Units

A Microsoft Research controlled study found that developers using GitHub Copilot completed representative coding tasks 55.8% faster on average — but task success rate improved only from 70% to 78% (GitHub, Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness, 2023). Speed goes up dramatically. Correctness goes up modestly. That gap is bridged by structured generation, not better prompting.

"Bounded and reviewable" means: never ask an LLM to generate an entire module in a single prompt. Generate one function at a time, one component at a time, one migration at a time. For each unit:

Provide the function signature or interface from your spec
State the input/output contract explicitly
Reference the pattern from your context layer you want it to match
Generate, then review before moving to the next unit

Is this slower than one-shot generation? Yes, upfront. It's faster than debugging one-shot generation — and dramatically faster than debugging one-shot generation that's been used as the foundation for five more units.

The compounding review problem is real. Every unit you skip reviewing multiplies the review burden of everything built on top of it. Generate small. Review immediately. Move forward.

Step 4: Run AI Code Review Before Human Review

In 2025, AI code review agent adoption jumped from 14.8% to 51.4% across enterprise engineering teams — and teams at 100% AI adoption cut median PR cycle time from 16.7 hours to 12.7 hours, a 24% reduction (Jellyfish, 2025 AI Metrics in Review, enterprise telemetry). That growth happened because AI review catches a specific class of bugs that human reviewers reliably miss: not because humans are careless, but because these bugs are tedious to look for under time pressure.

Unhandled edge cases. Inconsistent naming across the diff. Missing error handling in the happy path. Security anti-patterns from misusing a standard library. None of these are conceptually hard to catch. They just don't get caught because they're boring — and AI review is tireless.

The structured review prompt that works:

code

Review this diff against:
1. The spec: [paste relevant spec section]
2. Edge cases not covered
3. Security concerns
4. Consistency with these conventions: [paste context layer]

Return a numbered list. For each issue: severity (blocking / non-blocking) and a one-line fix.

The structured output format matters. Prose review is hard to act on. A numbered list with severity labels is a checklist. Fix blocking items before you open the PR for human review. List non-blocking items in the PR description as "known trade-offs" or "planned follow-ups." This tells human reviewers exactly where to spend their attention.

Close-up of a developer's monitor displaying colorful syntax-highlighted code in a dark workspace, illustrating the AI-assisted code review phase of a modern development workflow

According to Jellyfish's 2025 AI Metrics in Review, code review agent adoption grew from 14.8% to 51.4% between January and October 2025 alone. Teams that added AI review saw more consistent cycle time improvements than teams that only used AI for code generation — suggesting the review step is where workflow discipline translates most directly into delivery speed.

Source: Jellyfish, 2025 AI Metrics in Review

Step 5: Generate Tests from the Spec, Not from the Code

In January 2026, Salesforce published results from using Cursor AI for test generation across 76 legacy repositories: coverage effort dropped from 26 engineer-days per repository to just 4 — an 85% reduction — while PR velocity increased more than 30% (Cursor, Salesforce Engineering Case Study, January 2026, n=20,000+ Salesforce developers on Cursor).

The critical detail in that result: tests were generated from specifications describing expected behavior, not from existing implementations. Why does the distinction matter? Tests generated from code test what the code does — including its bugs. Tests generated from specs test what the code is supposed to do, independently of how it was implemented.

The test generation workflow:

Take a behavior from your spec: "Given a user with role VIEWER, accessing the admin endpoint should return 403."
Prompt: "Generate test cases for this behavior. For each case: input, expected output, and the edge condition it covers."
Review the test list. Add missing edge cases.
Prompt: "Implement these test cases in [your test framework]."
Run them. Fix failures in the implementation. If a test seems wrong, fix the spec first — not the test.

That last rule is non-negotiable. If you modify a test to make it pass rather than fixing the underlying behavior, you've broken the spec's integrity. The test is the contract.

Multiple monitors displaying programming code with neon ambient lighting in a professional developer workstation, representing software testing and debugging with AI assistance

Step 6: Debug AI Output Differently Than Your Own Code

In 2025, 66% of Stack Overflow survey respondents reported spending more time fixing "almost-right" AI-generated code than they'd saved in generation time (Stack Overflow 2025 Developer Survey, n=49,000+ developers). That's not a model quality problem — it's a debugging workflow problem.

When AI-generated code fails, the instinct is to re-prompt for a fix. That's usually the wrong first move. Re-generating a whole unit to patch a specific bug introduces new problems alongside the fix. It's like rebuilding a wall because of a faulty outlet.

Triage the failure first:

Is the spec violated? The LLM misunderstood a constraint. Fix that spec section and regenerate just the affected unit.
Is it an implementation bug? Provide the LLM with the error message, the failing test, and the specific function — not the whole file — and ask for a targeted fix.

Before re-prompting anything, ask the LLM to explain what it thinks the code does. Then compare that explanation to your expectation. Mismatches reveal the root cause faster than reading the error trace alone — and they tell you whether to fix the spec, the implementation, or your own mental model.

Treating LLM output as a junior engineer's first PR changes how you debug it. You'd explain the expected behavior to a junior engineer before asking them to fix a bug. The same approach works here and consistently finds the root cause faster than re-prompting from the error message alone.

Common Mistakes That Undermine the Whole Workflow

In 2025, Stack Overflow's Developer Survey found that developer trust in AI-generated code accuracy fell from 40% to 29% year over year — and positive favorability dropped from 72% to 60% (Stack Overflow 2025 Developer Survey, n=49,000+). Most of that erosion traces to predictable workflow failures, not model quality decline.

Source: Stack Overflow Developer Survey 2024 and 2025

1. Skipping the spec step and going straight to code generation. This is the single largest source of "almost-right" AI output. Without a spec, the LLM optimises for coherent-sounding code, not correct-for-your-system code. Even a one-paragraph spec meaningfully narrows the model's output distribution toward what you actually need.

2. Treating one-shot output as production-ready. Every generated unit is a first draft. Review it before using it as the foundation for the next unit. The compounding cost of skipped reviews is non-linear — it doesn't add, it multiplies.

3. Accumulating AI tools without consolidating. In 2025, GitLab found that teams using more than 5 AI tools faced roughly 7 hours per team member per week in compliance and integration overhead (GitLab, 2025 Global DevSecOps Survey, n=3,266). Pick the minimum set of tools that covers each phase of this workflow. Resist the pull to add more.

4. Letting the context layer go stale. A stale context file doesn't fail loudly. It silently anchors the model to patterns that no longer match your codebase, producing output that looks right and fits wrong. Treat context layer updates as a first-class engineering task.

Frequently Asked Questions

Which AI coding tool is best for this workflow?

Any tool with file-level context access works: Claude Code, Cursor, and GitHub Copilot each have different strengths but support the same six-step structure. In 2026, JetBrains found Copilot, Cursor, and Claude Code each at 18–29% professional at-work adoption. Start with whatever your team already uses and apply the workflow structure around it.

Does this workflow work for solo developers, or only for teams?

It scales to both. The spec step and context layer are arguably more valuable for a solo developer — with no team to catch context drift, systematic structure does that job instead. The AI review step replaces some (not all) of the value of a second pair of human eyes.

How long does the context layer setup take for an existing project?

Roughly 45–90 minutes for an existing project, less for a greenfield one. The initial investment compounds: every subsequent session benefits from not having to re-explain your architecture. Most developers who do it once don't go back to cold-session prompting.

What if AI-generated tests pass but the feature is still broken?

This almost always means tests were generated from the implementation rather than the spec. Tests that reflect what the code does — including its bugs — don't catch regressions. Revisit Step 5: generate test cases from the spec's behavioral descriptions first, then implement them separately.

How do I use this workflow with proprietary or sensitive code?

Check your tool's data handling policy. Claude Code and enterprise tiers of Copilot and Cursor don't train on your code. For highly sensitive codebases, a self-hosted model (Llama 3, CodeLlama, or similar via Ollama) works with the same workflow — the structure doesn't depend on a specific model or provider.

Closing

The six steps here aren't complex: persistent context, spec before code, bounded generation, AI review before human review, spec-grounded test generation, structured debugging.

What separates the teams in the Jellyfish data — the ones who doubled PR velocity — from everyone else isn't which tools they used. It's that AI was built into the structure of their workflow at every phase, not treated as a shortcut they reached for when stuck.

Try one step this sprint. The context layer is the highest-leverage starting point. Everything downstream gets better once the model has something stable to anchor to.

AI Engineering