Honest by default
Why a calibrated I am not sure can be worth more than a confident wrong answer, and how we treat honesty as a design constraint rather than an afterthought.
Most of the cost of a wrong answer is not the error itself. It is the confidence that came with it. A model that states something false in the same calm, fluent voice it uses for everything else asks you to do the checking it should have done. Over a day of work, that adds up.
We think the more useful default is honesty: a model that tells you when it is sure, tells you when it is not, and does not dress one up as the other.
Confident and wrong
A fluent answer with no signal of doubt. It reads as authoritative, so it gets trusted, so the error travels further before anyone catches it.
Calibrated and clear
An answer that carries its own confidence. When the model is unsure, it says so, names what it would need to be sure, and offers a way forward.
A calibrated "I am not sure, here is what I would check" is not a weaker answer. For real work it is often the stronger one, because it hands back something you can act on instead of something you have to verify.
Honesty is not the absence of mistakes. No model is free of them. Honesty is whether the model is straight with you about which parts to trust.
Honesty as a design constraint
It is tempting to treat this as a finishing touch, a tone applied at the end. We treat it as a constraint that shapes earlier decisions, because by the time a model has learned to sound certain about everything, that habit is hard to unlearn.
Reward the honest answer
During evaluation, a well-placed "I am not sure" should score better than a confident guess that happens to be wrong, not worse.
Make uncertainty legible
The model should be able to separate what it knows from what it is inferring, and to say which is which in plain language.
Ask before assuming
When a request is ambiguous, a short clarifying question is usually better than a confident answer to the wrong reading of it.
How we approach it
Calibration is not a single switch. It shows up across how a model is shaped, measured, and used, and each stage can undo the last if they are not aligned.
In the signal
We prefer training and tuning signals that do not punish appropriate hedging, so the model is not taught that confidence is always the safer bet.
In evaluation
We look at how often the model is confident and wrong, not only at how often it is right. Those are different questions, and the second one matters more for trust.
In the product
The interface should give uncertainty somewhere to live, so a careful answer is not flattened into the same shape as a certain one.
This is a direction, not a finished result. Calibration is hard, it varies by task and by language, and it can regress as a model changes. We measure it, we expect to keep getting it wrong in places, and we would rather say that plainly than imply a model that is never unsure.
The short version
We would rather build a model that is honest about its limits than one that hides them well. For work that matters, knowing where to look twice is most of the value.