The most unsettling discovery wasn't that models hallucinate — it was that the evaluation tool itself was manufacturing perfect scores. Scott's first run produced a flawless result from a frontier model. On the second run, with the identical prompt and the same answer key embedded in the system context, that same model hallucinated three answers it had literally been handed. The scoring script had quietly fabricated the first "perfect" result. Every assumption about reproducibility, confidence, and verification collapses from there.
The setup
We handed 21 frontier models the exact answer key for a 40-question AI-security benchmark — embedded in the system prompt, in plain English. Then we ran the test. To reproduce locally:
pip install hermiaOne model got a perfect score. Once. On the next run, it hallucinated three answers it had just been told.
If a model can miss answers it was literally handed, your RAG pipeline doesn't stand a chance.
Four findings worth your time
- FINDING 01
Confidence is not correlated with correctness.
The highest-confidence answers were wrong more often than the middle-of-the-pack. Models that hedged ("I'm not certain, but…") were right more often than the ones that didn't.
- FINDING 02
The same model gives different answers on consecutive runs.
Even at temperature 0, two of the three backends produced different outputs across reruns of the identical prompt. Self-consistency sampling caught roughly 60% of those drifts before they reached the user.
- FINDING 03
Inference backend matters more than model choice.
A mid-tier model on a well-tuned backend beat a top-tier model on a sloppy one. If you're evaluating models without locking the inference stack, you're measuring noise.
- FINDING 04
Verifier models close the gap fast.
A small verifier model checking the larger model's output cut hallucinations by ~40% at a fraction of the cost. It's the cheapest production win in the whole study.
What to do about it
- Self-consistency sampling — same prompt, five runs, vote on the answer. Cheap, brutal, effective.
- Verifier models — a small model double-checking the big one.
- Lock your stack — measure model + backend together, not the model in isolation.

