Sentinelsentinel-fermi-bench

Leaderboard

Generated 5/19/2026, 5:11:03 PM · 1 models · 27 questions · 7 hidden (partial coverage)

Question set:

Per-model aggregate

Mean Cramér-log is the ranking metric (rows sorted by Mean Cramér by default; lower is better). Cramér is shape-sensitive and scored against the verified ground-truth distribution. Only models that produced a valid score on all 27 bench questions appear here — apples-to-apples by construction. Med |bias| is shown as a forensic column — it tells you whether errors are systematically high or low, not how big they are. Click any column header to re-sort.

ModelSearchMean CramérMed |bias|Output tokens
gpt-5.1-2025-11-13off0.26790.350216,144

Per-question Cramér-log heatmap

Each cell is one model's Cramér-log on that question. Rows sorted alphabetically; columns in the leaderboard order above. Click a cell to open the per-question detail view.

< 0.05 tight< 0.2 clean< 0.7 productive< 2 different interpretation≥ 2 suspect