AI Engineering

Caveman: How Stone-Age Grammar Cuts AI Agent Token Costs by 65%

May 20, 2026·8 min read

Developer ToolsLLMs

You're mid-sprint. Your AI coding agent is explaining a React re-render bug with caveats, background context, and prose scaffolding you didn't ask for. The fix is three lines of code. The response is 1,180 tokens.

This is the output verbosity problem. And a viral open-source tool called Caveman is solving it in the strangest possible way: by forcing your AI agent to talk like a prehistoric human.

The Token Math Gets Worse Than You Think

Claude Sonnet 4.6 output tokens cost $15.00 per million. Opus 4.6 runs $25.00. Per-token, that sounds manageable — until you account for what agentic workflows actually do to the equation.

A single chatbot exchange produces one response. A five-step agentic loop produces 3.2x the tokens of the same task in direct chat mode. A 200-step autonomous build loop can hit 100x the token intensity. The median developer running AI coding tools is now spending $480/month on API calls. The 99th percentile is above $4,200.

And here's the part worth sitting with: a meaningful chunk of that spend isn't reasoning or tool use. It's the agent narrating its own thinking process, adding hedges nobody asked for, and building explanations for decisions you already trust it to make. That's pure output verbosity. It's money spent on words, not work.

Most token optimization tools attack the input side — compressing prompts, trimming context windows, deduplicating memory files. Caveman is one of the few tools designed specifically for the output side, which is harder to control but higher-leverage in tight coding loops.

What Caveman Actually Does

The JuliusBrussee/caveman repository (MIT license, 62,300+ stars as of May 2026) is a Claude Code skill and multi-agent plugin that installs a persona constraint directly into your coding agent's system context. That constraint is blunt: respond like a caveman.

No articles. No prepositions where unnecessary. No hedging clauses. No "of course, I'd be happy to help you with that." Just the answer.

The project's tagline demonstrates the philosophy in its own syntax: "why use many token when few token do trick."

Here's what that looks like in practice on a real debug task:

Default agent output — React re-render issue:

"I can see that the component is re-rendering unnecessarily because the useCallback hook is missing its dependency array. This means every time the parent re-renders, a new function reference is created, causing child components to re-render as well. To fix this, you'll want to..." (1,180 tokens)

Caveman output:

"useCallback miss dep array. new ref on parent render = child rerender. add [stableId] to deps." (159 tokens — 87% reduction)

The technical information is identical. What's stripped is the narrative scaffolding around it.

This works because large language models respond to persona constraints more reliably than vague instructions. Telling an agent to "be concise" produces inconsistent results — the model reverts to verbose defaults under generation pressure. Telling it to be a caveman activates a stable behavioral pattern the model has internalized from training data. The persona acts as a hard constraint, not a suggestion.

What the Benchmarks Show

The Caveman README published compression data across 10 representative development tasks. The average is 65% output token reduction, with a range from 22% on open-ended discussion tasks to 87% on structured debug tasks.

Task	Default	Caveman	Reduction
React re-render debug	1,180	159	87%
PostgreSQL setup	2,347	380	84%
Auth middleware fix	704	121	83%
Git rebase explanation	~290	~122	58%
PR security review	678	398	41%

Source: JuliusBrussee/caveman README, May 2026

The variance is itself informative. Tasks with the lowest compression (41%) are judgment-heavy reviews where nuanced reasoning genuinely adds value. Tasks with the highest compression (87%) are cases where a deterministic answer was wrapped in explanatory prose that served no technical purpose.

If you start running Caveman and see consistent 80%+ compression across your sessions, that's a diagnostic signal — the default verbosity wasn't serving you. It was padding output for its own sake.

At $15/million output tokens, a 65% daily reduction on 100,000 output tokens saves roughly $975/month. That's before accounting for compounding: shorter responses return faster, which means fewer tokens re-sent as context on subsequent loop iterations.

Installing It Takes 60 Seconds

There's no config file to write. No API key to provision. No dependency chain to resolve.

macOS / Linux / WSL:

bash

curl -fsSL https://juliusbrussee.github.io/caveman/install.sh | bash

Windows (PowerShell):

powershell

irm https://juliusbrussee.github.io/caveman/install.ps1 | iex

The installer detects your active agent environment and registers the skill automatically. Three commands become available across all 30+ supported agents:

/caveman — activates caveman grammar mode for the current session
/caveman-stats — shows lifetime token savings with a USD estimate
/caveman-compress — compresses memory files, cutting input tokens by ~46%

Compatibility covers Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, Cline, GitHub Copilot, Aider, and 22 more — with identical commands across all of them.

The hidden upside is /caveman-compress. Memory files in persistent agent setups grow significantly over weeks of daily use. Compressing them reduces the input tokens sent on every subsequent agent call. It's a one-time operation that compounds across every future session — and it works on the input side, stacking on top of the output savings from caveman mode itself.

How It Compares to the Alternatives

Token compression isn't a new idea, but the existing solutions sit in different parts of the stack.

Microsoft LLMLingua (~5,800 stars) is a research-grade Python library that compresses input prompts using a smaller LM to score token importance — achieving up to 20x compression in research conditions. Technically impressive, but it requires significant setup, targets input tokens rather than output, and isn't integrated into any IDE plugin workflow. It's built for researchers, not for developers who want to reduce their daily bill.

wilpel/caveman-compression (~947 stars) is a Python library implementing three compression strategies on context windows — LLM-based, MLM-based, and NLP-based — achieving around 40% average compression with 100% factual preservation across tested outputs. Solid approach, no agent integration.

Anthropic's built-in context compaction is reactive. It triggers auto-summarization when you hit the context window limit. There's no proactive savings and no visibility into spend — it's a safety net, not an optimization layer.

What makes Caveman's position interesting is the combination: it targets the right surface (output tokens, interactive workflows), uses a mechanism that's reliable in practice (persona constraint beats vague instruction every time), and has near-zero setup friction. It's not the most technically sophisticated approach in the field. It's the most practical one for a developer's actual daily workflow.

The Bigger Signal

Caveman is a useful tool. It's also a signal about a structural problem.

The fact that an agent explaining a three-line bug fix produces 1,180 tokens by default isn't a quirk — it's a consequence of how these models were trained. RLHF reward models tend to favor detailed, thorough-sounding responses. That preference gets baked into weights. For a chatbot, that's often the right call. For a coding agent in a tight loop, it's expensive noise.

The persona constraint approach is a clever workaround, not a root fix. The root fix is fine-tuning — which is exactly what cavegemma, the fourth tool in JuliusBrussee's ecosystem, is building toward. A fine-tuned model would produce compressed output by default, with no persona prompt overhead and no behavioral shaping required. That's a materially different reliability profile.

JuliusBrussee is building a four-tool ecosystem around this philosophy: caveman for output compression, cavemem for cross-agent persistent memory (TypeScript + SQLite with FTS5), cavekit for spec-driven parallel build loops, and cavegemma as the eventual model layer. Whether this matures into a serious platform depends on sustained adoption — and with 62,300+ stars in under two months, the demand signal is there.

The Bottom Line

Until low-verbosity fine-tuned models are widely available and agent-integrated, behavioral constraints like Caveman are the practical path forward. The install is low-risk. The math is real. A 65% average output token reduction compounding over daily agent usage adds up to hundreds of dollars in savings and a snappier development loop.

The engineers running the tightest AI coding workflows in 2026 aren't just picking better models or writing better prompts. They're thinking carefully about where tokens go and cutting the ones that don't do work. Caveman makes that easy.

AI Engineering