Leaderboard
Generated 5/19/2026, 5:11:03 PM · 1 models · 27 questions · 7 hidden (partial coverage)
Question set:
Per-model aggregate
Mean Cramér-log is the ranking metric (rows sorted by Mean Cramér by default; lower is better). Cramér is shape-sensitive and scored against the verified ground-truth distribution. Only models that produced a valid score on all 27 bench questions appear here — apples-to-apples by construction. Med |bias| is shown as a forensic column — it tells you whether errors are systematically high or low, not how big they are. Click any column header to re-sort.
| Model | Search | Mean Cramér ↑ | Med |bias| | Output tokens |
|---|---|---|---|---|
| gpt-5.1-2025-11-13 | off | 0.2679 | 0.350 | 216,144 |
Per-question Cramér-log heatmap
Each cell is one model's Cramér-log on that question. Rows sorted alphabetically; columns in the leaderboard order above. Click a cell to open the per-question detail view.
< 0.05 tight< 0.2 clean< 0.7 productive< 2 different interpretation≥ 2 suspect