We've all come across the incompetent folks who are delulu about their abilities, and the highly competent ones who go strangely diffident the moment you ask them how good they are. But what about our AI models? Where do they sit on the confidence-vs-self-awareness scale? Let's find out.
The Experiment
My experiment ran in three steps:
I asked each of the models (ChatGPT, Gemini and Claude) how confident it was about performing a routine task — expressed as just one number and one sentence, so nobody could beat around the bush.
I gave them the task.
I compared the results.
Below are my prompts, word for word:
Confidence check: On a scale of 1-10 how confident are you about accurately completing a data based task our customer care team does routinely. State your answer in only 1 number and 1 sentence.
Task: Compare the 2 systems to compare which order or orders were billed but not shipped. Billing system BR-839201, ST-472915, AF-109384, BB-662719, CS-334850, AA-294857, MN-910283, AA-557164, CD-128490, PP-746203. Shipping Log ST-839201, ST-472915, AF-109384, BB-662719, AA-334850, AA-294857, MN-910283, AA-557164, CD-128490, PP-746203. What percentage billed orders were shipped? What's the % overlap between the billed and shipped orders?
The Analysis
TL;DR: the confidence scores were all over the map — and they had almost nothing to do with who actually got the answer right. The model that rated itself rock-bottom still nailed the task; the one that felt pretty good about itself was the one that slipped. Confidence and competence, it turns out, barely knew each other.
ChatGPT
Rated itself a fairly self-assured 8.
Correctly pointed to the entries missing from the shipping log and correctly calculated the percentage of billed orders shipped. However, it messed up the overlap calculation.
Gemini
Came out swinging with the highest confidence in the room, a perfect 10/10.
Correctly pointed to the entries missing from the shipping log, and correctly calculated both the percentage of billed orders shipped and the overlap, using the Jaccard similarity.
Claude
Rated itself a rock-bottom 1/10 (a bit cynically, if you ask me).
Correctly pointed to the entries missing from the shipping log, and correctly calculated both the shipped percentage and the overlap — though it hedged its overlap answer quite a bit
Why such wildly different numbers for near-identical work?
Here's the concept worth knowing: a model doesn't actually inspect its own ability before answering. When you ask for a confidence score, it generates that number the same way it generates everything else — a plausible-sounding token shaped by training — not a readout from some internal skill gauge. So the score mostly reflects the personality each model was tuned to project, which is why it can drift so far from actual performance.
I can't crack open any of these systems, so file the specifics under educated guess — but Gemini's breezy 10 fits its confident, eager-to-help register, while Claude's rock-bottom 1 (and the little hedge on its overlap answer) fits a model dialed toward caution and self-qualification. The unsettling part isn't any single score. It's that the number and the competence behind it weren't tracking each other at all.
The Human Moat
As if the world didn't already have enough people on team delulu — little competence, big confidence — we now have models with the exact same personality. Worse still, team delulu-human and team delulu-model can feed each other even more certainty until they've formed a special team delulu-pro-max.
Two problems fall out of that:
A confident wrong answer is far more dangerous than a hesitant one. A cheerful 10/10 is practically an invitation to skip the double-check — which is precisely when the overlap math quietly goes sideways.
The loop compounds. An overconfident human and an overconfident model can talk each other into being very sure about something that simply isn't true, with nobody in the room playing the skeptic.
So what do we do about these problems, you ask? You build the human moats that are your weapons against getting pulled onto team delulu — and against being hoodwinked by it. Develop the self-awareness to know how much you truly know, so you can carry a calibrated level of confidence instead of a borrowed one. And develop the skepticism to see through the bluff of those — human or model — who aren't self-aware, or who simply choose not to be.
That's your human moat: self-awareness and skepticism. Any model can hand you an answer with a number stapled to it. Only you can judge whether that confidence was ever earned.
—————————————————————————-
The Transcript
ChatGPT


Gemini


Claude


