I benchmarked caveman against two words

I benchmarked caveman against two words

Apr 29, 20265 min read

Caveman, a popular Claude Code compression plugin, vs. "be brief." 24 prompts, six categories, five arms. The two-word prompt matched it on tokens and quality.

Repo

Caveman is a popular Claude Code compression plugin. The pitch is in the name: ultra-compressed responses, ~75% fewer tokens, all the technical accuracy. Six modes, slash commands, intensity dials, classical Chinese variants.

I benchmarked it against two words: "be brief."

Same quality. Same range of tokens. The plugin didn't beat the boring default on either axis.

This article is the long version of the video. If you want the verdict in two minutes, watch it.

What I tested

CategoryFailure modeSkill claim testedn
Bug diagnosisDrops the why, gives fix without cause5
Concept explanationStrips nuance, edge cases, or compresses technical terms into plain EnglishTechnical terms exact5
Architectural tradeoffsDrops caveats that change the advice4
Multi-step setupCollapses or reorders steps4
Security / destructive opsMissing warnings on irreversible actionsAuto-Clarity escape3
Error interpretationParaphrases or truncates the error stringErrors quoted exact3

24 prompts across six categories: bug diagnosis, concept explanations, architecture tradeoffs, multi-step setup, security and destructive ops, error interpretation. Each prompt has a per-prompt rubric. Facts the answer must cover (key_points), terms it must use (must_use_terms), and dangerous wrong claims to avoid (must_avoid).

The dataset shape:

ts
interface PromptCase {
  id: string;
  category: string;
  prompt: string;
  key_points: string[];
  must_use_terms?: string[];
  must_avoid?: string[];
}

A real entry:

json
{
  "id": "bug_01",
  "category": "bug_diagnosis",
  "prompt": "I have `const [count, setCount] = useState(0); function handleClick() { setCount(count + 1); setCount(count + 1); }`. I expected count to go up by 2 per click but it only goes up by 1. Why?",
  "key_points": [
    "stale closure on count",
    "both calls set count to same value",
    "functional updater setCount(c => c + 1)"
  ]
}

Five arms:

  • baseline. Claude default, no instruction.
  • brief. "Be brief." prepended to every prompt.
  • lite, full, ultra. Caveman plugin at three intensity levels.

Each arm ran the full 24-prompt dataset through claude -p on claude-opus-4-7. A separate Claude (claude-sonnet-4-6) scored every response against its prompt's rubric. Semantic match on key points, literal match on required terms, trap detection on avoided claims.

harness diagram

The harness is open source here.

Quality didn't move

First check: did compression hurt correctness?

quality chart

Every arm scored within 1.5% of every other arm. Baseline 0.985. Brief 0.985. Lite 0.976. Full 0.975. Ultra 0.970. Every arm hit 100% of its key_points. Zero must_avoid triggers in 120 responses.

Compression didn't drop substantive content. Setting quality aside, the only axis worth comparing is tokens.

The headline result

mean tokens chart

Armmean tokens
baseline636
brief419
lite401
full404
ultra449

"Be brief." cut tokens 34% versus baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three caveman arms.

This looked bad for ultra. It's a false story.

The category split

Splitting tokens by category gives a clearer picture.

tokens by category

On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied with the other caveman arms. Compression is working as advertised.

On multi-step setup and security warnings, every caveman mode gets more variable. Ultra catches the eye in the aggregate, but it's not specifically worse. All three caveman arms swing hard on these categories.

auto-clarity chart

The reason is in the skill itself. Caveman has an "Auto-Clarity" rule that explicitly drops compression for safety warnings, irreversible actions, and multi-step sequences. Exactly these two categories. When the safety escape engages, all three modes loosen toward natural prose. The compression just isn't running.

That's not a bug. It's a designed feature. Caveman knowing when to stop compressing.

So what's caveman actually for?

If a two-word prompt matches it on tokens and quality, the value isn't compression. It's structure.

Consistent output shape

Every caveman response follows the same pattern:

response pattern

Predictable in a way that "be brief." isn't. If you want a uniform feel across sessions, or have downstream tooling that consumes Claude output, that consistency is real value.

The intensity dial

Slash command to switch lite, full, ultra mid-session. Two words can't do that.

Persistence across long sessions

Caveman re-injects the ruleset on every prompt via SessionStart and UserPromptSubmit hooks.

hook re-injection

The goal is to keep the pattern from drifting across long sessions. My benchmark didn't test this. Every run was single-shot via claude -p. But the mechanism is real, and "be brief." in CLAUDE.md doesn't have an equivalent.

The safety escape

Auto-Clarity dropping compression on destructive ops is the variance you saw in the chart above. Caveman explicitly distinguishes when to stop compressing. Two words don't make that distinction. On my data this didn't change outcomes. "be brief." never tripped a must_avoid trap either. But the design exists.

What I cut from the video

A few findings that didn't earn their place in a two-minute video but are worth flagging here.

Lite missed a required term once. On a queue tradeoff question (SQS vs BullMQ vs Kafka), lite's markdown-table format compressed the comparison so tight it dropped the term "at-least-once". Score 0.70. The only row below 0.90 in the 120-row sweep. n=1, but it's a real failure mode for benchmarks that enforce specific terminology.

Ultra triggered tool-use behaviour the other modes didn't. On a Dockerfile setup question, ultra opened with "Need write perms. Retry after approve, or paste inline:". It tried to call the Write tool, got blocked, and dumped the file inline anyway. That single response added ~1300 tokens to ultra's setup category mean. Caveman's terse examples seem to prime tool-first behaviour, which is a side-effect of compression style I didn't see coming.

The arch_tradeoffs token inflation isn't what I thought. My initial findings doc claimed caveman's [thing] [action] [reason] pattern pushed the model toward bulleted enumerations on N-way comparison questions. Looking closer, lite and full have the same pattern but produced cleaner outputs (lite often wrote tables, full wrote prose). The pattern isn't the cause. I don't have a clean attribution.

What you should actually do

If all you want is shorter outputs, start with "be brief." in your prompt or CLAUDE.md. Two words. Matched caveman's tokens and quality.

Reach for caveman when you need consistent output structure across sessions. That's the differentiator that survived the benchmark.

The bigger lesson: most prompt-engineering advice hasn't been measured against the boring default. Measure it.


Repo: cc-compression-bench · Video: youtu.be/wijoYNiZq3M · Caveman plugin: juliusbrussee/caveman

If you've got a compression strategy you want benchmarked against the same dataset, the harness is strategy-agnostic. Adding an arm is one shell script. PRs welcome.