Decision Calibration

Forecast reliability · diagnostics

Illustrative scenario

Sample analyst · real engine · real prices. A fictional sample analyst’s decisions, scored by the same calibration engine against real market prices — not a real analyst track record. Notice her highest-conviction calls resolved far below what she claimed (~87% claimed → ~29% realized): overconfidence that stays invisible until the engine scores it. This is what calibration looks like in Goldeneye — point it at your desk and it scores your analysts the same way.

How well do your convictions calibrate?

Reliability diagram across your logged journal entries. Perfect calibration sits on the diagonal: a 70% conviction band should resolve as hits 70% of the time. Bands below the diagonal are over-confident, bands above are under-confident.

Headline

All conviction buckets calibrate within 5 percentage points of their claimed level. Either your decision quality is on-pace or the sample size is still too small to detect drift.

Sample

Total entries8

Resolved6

Unresolved2

Reliability diagram1 buckets · diagonal = perfect calibration

Per-bucket detail

Bucket	Claimed mean	Total n	Resolved	Hits	Hit rate
0-20%	—	0	0	0	n=0 (need 3+)
20-40%	—	0	0	0	n=0 (need 3+)
40-60%	51%	4	4	2	50%
60-80%	63%	2	2	1	n=2 (need 3+)
80-100%	93%	2	0	0	n=0 (need 3+)

Model Health · how each model fails

Loading…

Calib err (reliability) is how far stated confidence sits from realized hit-rate — lower is better; Sharpness (resolution) is how much the model discriminates across its confidence levels — higher is better; Dir gap flags a one-sided edge. Descriptive, in-sample over the backtest window — not a forward forecast.

Desk Calibration · skill vs. luck

Loading…

Verdict is the test that separates skill from luck — and correctly refuses to call noise skill. We take each desk's directional hit-rate and ask whether its 95% confidence interval clears a coin flip (50%): Skill = the lower bound beats chance on this sample; Luck = not yet distinguishable from chance (a hot streak isn't proof); Insufficient = fewer than 10 resolved calls. The blind random desk lands on Luck by design — that's the test working. Calibration (Brier on stated conviction) separately measures whether confidence is reliable. Descriptive decision-quality diagnostics, not advice. How we validate →