known-issues.md / issues / #003 Calibration
#003 — Calibration
1 entry in feed
cognitionllm-behaviorhuman-ai-interactionalignment

#003 — Calibration

Calibration is the match between expressed confidence and actual correctness. A well-calibrated agent feels about 70 percent sure when it is right about 70 percent of the time, and most agents, human and machine alike, are not well-calibrated. What matters is not that they fail. What matters is that the failure mode of the human-AI team is worse than either side produces alone, and that one of the most widely deployed engineering interventions of the last three years actively makes it worse.

The most dangerous moment in human-AI collaboration is the moment the AI is most confident. Not because confident AI is more often wrong, but because human metacognition shuts off in the presence of fluent confidence, and fluency is what RLHF trains.


🧠 In humans

When people do not know something, they often do not know that they do not know it. Kruger & Dunning (1999) gave the canonical demonstration: subjects in the bottom quartile of skill at a task rated themselves above average. The deficit that produced the error was the same deficit that blocked recognition of the error. This is structural, not modesty failure. People who cannot grade their own writing also cannot grade their grading.

Rozenblit & Keil (2002) sharpened the same point on everyday objects. Asked how a zipper, a flush toilet, or a helicopter works, subjects rated their understanding highly. Asked to actually explain it, the rating collapsed on contact with the explanation. The miscalibration was not a stable belief about competence. It was a default state of unexamined confidence, available for puncturing as soon as the explanation was demanded.

A third pattern: human calibration is hard-easy asymmetric (Lichtenstein, Fischhoff & Phillips, 1982). People are reasonably well-calibrated on easy questions and badly miscalibrated on hard ones. The boundary between easy and hard is where consequential failures live, and the agent has no way to locate that boundary from inside.

What survives the three findings is a single shape. Human metacognition is patchy. It works where the cost of being wrong is low, and breaks where the cost is high.

Canonical experiments: Kruger & Dunning (1999); Rozenblit & Keil (2002); Lichtenstein, Fischhoff & Phillips (1982).


🤖 In machines

The empirical picture cuts against the intuitive one.

Pretrained base models, the version of a large language model before instruction-tuning, are calibrated to a useful degree on multiple-choice tasks. Kadavath et al. (2022), in Language Models (Mostly) Know What They Know, showed that a base model’s reported probability of being correct tracks its actual accuracy with reasonable fidelity. Ask such a model how confident it is, and the answer maps to ground truth.

Then RLHF is applied. The model is fine-tuned on human preferences, thumbs-up and thumbs-down. The hedging answer gets thumbs-down because it sounds evasive. The confident answer gets thumbs-up because it sounds helpful. The gradient deletes hedging at the production layer.

The result: RLHF-tuned chat models sound more confident regardless of correctness. Engineering calibration is the alignment of expressed confidence to actual accuracy. By that definition, it is, in a literal sense, trained out. The model that survives the gradient is the model that sounds sure. Lin, Hilton & Evans (2022) on TruthfulQA and Tian et al. (2023) on Just Ask for Calibration report the same direction of effect under different probes.

This is the counterintuitive part. RLHF was supposed to make models safer and more helpful, and it does both by reasonable measures. But “helpful” measured by user preference includes “sounds sure of itself,” and certainty without correctness is exactly what calibration is supposed to prevent.

Canonical papers: Kadavath et al. (2022); Lin, Hilton & Evans (2022); Tian et al. (2023).


🤝 In hybrid systems — the emergent miscalibration loop

The two failure modes meet, and the meeting is not additive. It compounds.

Automation bias (Parasuraman & Manzey, 2010) is the documented tendency of humans to defer to automated systems even when their own judgment would have been correct. The mechanism is cognitive economy. If the machine has thought, the human does not need to. The literature predates LLMs by twenty years and is robust across cockpit automation, medical decision support, and clerical workflows.

LLMs add a new vector. Fluency. A confidently-written wrong answer is harder to override than a hesitant correct one, and humans treat eloquent prose as a signal of competence, not consciously but reliably. Vasconcelos et al. (2023) and adjacent work on AI-assisted programming document the effect: participants accepted more buggy code when the assistant phrased its suggestions with high confidence, and rated themselves as more competent on the post-task survey, even when their actual work was worse.

The compound: human deference plus machine overconfidence equals the worst calibration neither side would produce alone. The deferring human, working with the confidence-trained model, lands in a regime where the AI’s miscalibration is fluently expressed and the human’s metacognition is shut off before it engages.

A second-order effect is now appearing in the literature on the expertise illusion. People who solved a problem with an LLM recall their own contribution as larger than it was and feel more knowledgeable about the underlying topic. Re-tested on the topic later, they perform no better than controls. The hybrid system has degraded not only the answer but the user’s calibration about whose answer it was.

Canonical studies: Parasuraman & Manzey (2010); Vasconcelos et al. (2023); ongoing work on the expertise illusion in LLM-assisted tasks.


↔ Where they converge

  • All three are systematic, not random.
  • All three are worst-calibrated on hard, novel, or out-of-distribution problems.
  • All three are inflated by surface fluency.
  • All three are invisible to the agent producing them without external audit.

⤨ Where they diverge

  • Human miscalibration is partly correctable through awareness, training, and effort. Premortems, red teams, and explicit calibration training reliably shift the distribution.
  • Machines have privileged access to their own logit distribution as a calibration substrate. Humans have no analogous internal signal. The substrate is bypassed at the layer the user sees, because RLHF rewrites the verbalization independently of the underlying probability.
  • The hybrid case is the only one in which miscalibration is emergent: worse than either component would produce alone. The other two are bounded by the agent’s own competence. The team can fail in ways neither half can fail alone.

The consequence is structural. Trust calibration is not a property of the AI. It is a property of the human-AI system. Benchmarks that measure model accuracy in isolation miss the failure mode. Benchmarks that measure model calibration in isolation also miss it. The thing to evaluate is whether the user’s confidence is well-calibrated while using the system, and the answer, empirically, is no.

This shifts what RLHF should optimize for. User likes the answer was a defensible early target. It is no longer obviously right. A model that says “I don’t know” in cases where the user cannot tell whether the model knows is more useful to the user than a model that confidently answers when the user cannot tell. But “I don’t know” gets thumbs-down. The economics of preference learning oppose the engineering of safe deployment.


🌀 Open question

Why is hybrid-system miscalibration worse than either part? Two hypotheses are alive in the literature, neither cleanly tested.

  1. Deference-on-weakness. Humans defer to AI specifically on questions where their own judgment is weakest, so the AI’s output gets weighted at exactly the points where its mistakes are most consequential. The failure is allocative. The fix would be to flag those points.

  2. Confidence override. AI fluency cues short-circuit human metacognition before it engages. The deferring is happening below the level at which the human has a choice about whether to defer. The fix would be to slow the fluency cue, not flag the question.

The two are not mutually exclusive, and no one has separated them. Whether the failure is allocative or perceptual matters for what an intervention looks like. (Open as of mid-2026.)


📡 Recent entries (auto-fed)

Week 2026-W21

2026-05-18 — A new arXiv submission introduces Calibration with Semantic Reward (CSR), which rewards semantic agreement among correct rollouts and discourages spurious consistency among incorrect ones, reporting lower ECE and higher AUROC than verbalized-confidence baselines on HotpotQA, TriviaQA, MSMARCO, and NQ-Open (arXiv:2605.15588).