I benchmarked caveman against two words
Caveman, a popular Claude Code compression plugin, vs. "be brief." 24 prompts, six categories, five arms. The two-word prompt matched it on tokens and quality.
Caveman is a popular Claude Code compression plugin. The pitch is in the name: ultra-compressed responses, ~75% fewer tokens, all the technical accuracy. Six modes, slash commands, intensity dials, classical Chinese variants.
I benchmarked it against two words: "be brief."
Same quality. Same range of tokens. The plugin didn't beat the boring default on either axis.
This article is the long version of the video. If you want the verdict in two minutes, watch it.
What I tested
| Category | Failure mode | Skill claim tested | n |
|---|---|---|---|
| Bug diagnosis | Drops the why, gives fix without cause | — | 5 |
| Concept explanation | Strips nuance, edge cases, or compresses technical terms into plain English | Technical terms exact | 5 |
| Architectural tradeoffs | Drops caveats that change the advice | — | 4 |
| Multi-step setup | Collapses or reorders steps | — | 4 |
| Security / destructive ops | Missing warnings on irreversible actions | Auto-Clarity escape | 3 |
| Error interpretation | Paraphrases or truncates the error string | Errors quoted exact | 3 |
24 prompts across six categories: bug diagnosis, concept explanations, architecture tradeoffs, multi-step setup, security and destructive ops, error interpretation. Each prompt has a per-prompt rubric. Facts the answer must cover (key_points), terms it must use (must_use_terms), and dangerous wrong claims to avoid (must_avoid).
The dataset shape:
interface PromptCase {
id: string;
category: string;
prompt: string;
key_points: string[];
must_use_terms?: string[];
must_avoid?: string[];
}A real entry:
{
"id": "bug_01",
"category": "bug_diagnosis",
"prompt": "I have `const [count, setCount] = useState(0); function handleClick() { setCount(count + 1); setCount(count + 1); }`. I expected count to go up by 2 per click but it only goes up by 1. Why?",
"key_points": [
"stale closure on count",
"both calls set count to same value",
"functional updater setCount(c => c + 1)"
]
}Five arms:
- baseline. Claude default, no instruction.
- brief.
"Be brief."prepended to every prompt. - lite, full, ultra. Caveman plugin at three intensity levels.
Each arm ran the full 24-prompt dataset through claude -p on claude-opus-4-7. A separate Claude (claude-sonnet-4-6) scored every response against its prompt's rubric. Semantic match on key points, literal match on required terms, trap detection on avoided claims.

The harness is open source here.
Quality didn't move
First check: did compression hurt correctness?

Every arm scored within 1.5% of every other arm. Baseline 0.985. Brief 0.985. Lite 0.976. Full 0.975. Ultra 0.970. Every arm hit 100% of its key_points. Zero must_avoid triggers in 120 responses.
Compression didn't drop substantive content. Setting quality aside, the only axis worth comparing is tokens.
The headline result

| Arm | mean tokens |
|---|---|
| baseline | 636 |
| brief | 419 |
| lite | 401 |
| full | 404 |
| ultra | 449 |
"Be brief." cut tokens 34% versus baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three caveman arms.
This looked bad for ultra. It's a false story.
The category split
Splitting tokens by category gives a clearer picture.

On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied with the other caveman arms. Compression is working as advertised.
On multi-step setup and security warnings, every caveman mode gets more variable. Ultra catches the eye in the aggregate, but it's not specifically worse. All three caveman arms swing hard on these categories.

The reason is in the skill itself. Caveman has an "Auto-Clarity" rule that explicitly drops compression for safety warnings, irreversible actions, and multi-step sequences. Exactly these two categories. When the safety escape engages, all three modes loosen toward natural prose. The compression just isn't running.
That's not a bug. It's a designed feature. Caveman knowing when to stop compressing.
So what's caveman actually for?
If a two-word prompt matches it on tokens and quality, the value isn't compression. It's structure.
Consistent output shape
Every caveman response follows the same pattern:

Predictable in a way that "be brief." isn't. If you want a uniform feel across sessions, or have downstream tooling that consumes Claude output, that consistency is real value.
The intensity dial
Slash command to switch lite, full, ultra mid-session. Two words can't do that.
Persistence across long sessions
Caveman re-injects the ruleset on every prompt via SessionStart and UserPromptSubmit hooks.

The goal is to keep the pattern from drifting across long sessions. My benchmark didn't test this. Every run was single-shot via claude -p. But the mechanism is real, and "be brief." in CLAUDE.md doesn't have an equivalent.
The safety escape
Auto-Clarity dropping compression on destructive ops is the variance you saw in the chart above. Caveman explicitly distinguishes when to stop compressing. Two words don't make that distinction. On my data this didn't change outcomes. "be brief." never tripped a must_avoid trap either. But the design exists.
What I cut from the video
A few findings that didn't earn their place in a two-minute video but are worth flagging here.
Lite missed a required term once. On a queue tradeoff question (SQS vs BullMQ vs Kafka), lite's markdown-table format compressed the comparison so tight it dropped the term "at-least-once". Score 0.70. The only row below 0.90 in the 120-row sweep. n=1, but it's a real failure mode for benchmarks that enforce specific terminology.
Ultra triggered tool-use behaviour the other modes didn't. On a Dockerfile setup question, ultra opened with "Need write perms. Retry after approve, or paste inline:". It tried to call the Write tool, got blocked, and dumped the file inline anyway. That single response added ~1300 tokens to ultra's setup category mean. Caveman's terse examples seem to prime tool-first behaviour, which is a side-effect of compression style I didn't see coming.
The arch_tradeoffs token inflation isn't what I thought. My initial findings doc claimed caveman's [thing] [action] [reason] pattern pushed the model toward bulleted enumerations on N-way comparison questions. Looking closer, lite and full have the same pattern but produced cleaner outputs (lite often wrote tables, full wrote prose). The pattern isn't the cause. I don't have a clean attribution.
What you should actually do
If all you want is shorter outputs, start with "be brief." in your prompt or CLAUDE.md. Two words. Matched caveman's tokens and quality.
Reach for caveman when you need consistent output structure across sessions. That's the differentiator that survived the benchmark.
The bigger lesson: most prompt-engineering advice hasn't been measured against the boring default. Measure it.
Repo: cc-compression-bench · Video: youtu.be/wijoYNiZq3M · Caveman plugin: juliusbrussee/caveman
If you've got a compression strategy you want benchmarked against the same dataset, the harness is strategy-agnostic. Adding an arm is one shell script. PRs welcome.