Leaderboard

Models on the fully-open question set . Prompt and ground truth are public, so anyone can reproduce the runs.

BoardHeld-outPracticeReference

Editions over time

v1 · 22q · 7/7/2026current

Edition v1 · cut 7/7/2026 · 17 models · 22 questions · 22 cells imputed (worst-of-others)

Per-model aggregate

Mean Cramér-log is the ranking metric (lower is better), scored against the verified ground-truth distribution. The Answered column shows real vs imputed; a malformed or unscorable answer counts as unanswered.

#	Model	Mean Cramér	Answered	Ran
1	deepseek/deepseek-v4-pro	0.1663	22/22	7/7/2026
2	anthropic/claude-opus-4.8	0.2558	22/22	7/7/2026
3	google/gemini-3.1-pro-preview	0.2850	22/22	7/7/2026
4	openai/gpt-5.5	0.3064	22/22	7/7/2026
5	openai/gpt-5-mini	0.3701	21/22 · 1 imp	7/7/2026
6	z-ai/glm-5.2	0.3725	21/22 · 1 imp	7/7/2026
7	deepseek/deepseek-v3.2	0.3837	21/22 · 1 imp	7/7/2026
8	openai/o3	0.4386	21/22 · 1 imp	7/7/2026
9	anthropic/claude-sonnet-4.6	0.4593	22/22	7/7/2026
10	anthropic/claude-sonnet-5	0.6049	22/22	7/7/2026
11	openai/o3-mini	0.6374	22/22	7/7/2026
12	x-ai/grok-4.3	0.6519	22/22	7/7/2026
13	anthropic/claude-fable-5	0.6848	21/22 · 1 imp	7/8/2026
14	~anthropic/claude-fable-latest	0.7295	20/22 · 2 imp	7/8/2026
15	qwen/qwen3-235b-a22b-thinking-2507	0.8038	21/22 · 1 imp	7/7/2026
16	openai/gpt-5.5-pro	1.2231	9/22 · 13 imp	7/9/2026
17	meta-llama/llama-4-maverick	1.7806	21/22 · 1 imp	7/7/2026

Per-question Cramér-log heatmap

Each cell is one model's Cramér-log on that question. Imputed cells (the model didn't validly answer) show the worst-of-others penalty, marked †.

Question	deepseek-v4-pro	claude-opus-4.8	gemini-3.1-pro-preview	gpt-5.5	gpt-5-mini	glm-5.2	deepseek-v3.2	o3	claude-sonnet-4.6	claude-sonnet-5	o3-mini	grok-4.3	claude-fable-5	claude-fable-latest	qwen3-235b-a22b-thinking-2507	gpt-5.5-pro	llama-4-maverick
annual-concrete-grand-canyon-fraction	0.090	0.053	0.024	0.083	0.078	0.062	0.043	0.224	1.170	0.053	0.107	0.089	0.021🏆	0.032	0.050	0.055	0.742
bangladesh-tshirt-area-vs-factories	0.446	0.223	0.040	0.102	0.044	0.432	0.220	0.130	0.020🏆	0.099	0.647	1.032	0.400	0.211	0.118	1.032†	0.912
california-almond-water-per-dollar	0.610	0.051🏆	0.279	0.266	0.580	0.502	1.451†	0.157	1.179	0.257	1.451	0.395	0.243	0.329	0.531	1.451†	1.031
chaco-soybean-crossover	0.032	0.081	1.011	0.357	0.306	0.052	0.706	1.188	0.475	0.576	0.149	0.056	0.779	1.188†	0.008🏆	1.188†	0.881
china-concrete-three-gorges	0.022	0.056	0.013	0.140	0.067	0.219	0.048	0.741†	0.022	0.029	0.163	0.004🏆	0.066	0.096	0.138	0.087	0.741
electricity-vs-soy-per-worker	0.139	0.034🏆	0.765	0.573	1.337	0.181	0.775	0.153	1.423	0.991	0.969	0.283	0.857	0.846	0.735	0.325	0.054
germany-coal-to-solar-land-fraction	0.064🏆	1.446	0.561	0.263	0.762	2.917†	0.084	0.260	1.388	0.591	0.378	0.343	0.385	0.935	2.917	2.917†	2.741
honduras-costa-rica-homicide-gap	0.539🏆	0.890	0.727	0.561	1.163	1.083	1.029	0.957	1.093	0.597	1.273	0.813	0.938	0.934	0.572	0.619	1.273†
india-solid-fuel-to-lpg-import-years	0.040	0.245	0.034	0.124	0.037	0.244	0.020	0.059	0.086	0.434	1.503	0.059	0.039	0.001🏆	0.378	0.068	7.599
nigeria-korea-cereal-yields	0.348	0.298	0.025🏆	0.157	0.212	0.051	0.084	0.073	0.132	0.262	0.053	0.052	0.576	0.393	0.962	0.962†	0.173
ocean-fish-biomass	0.027🏆	0.205	0.901	1.106	0.066	0.028	0.205	0.753	0.581	4.820	0.154	1.943	8.238†	8.238†	8.238†	0.798	8.238
paraguay-electricity-per-co2	0.032	0.024	0.047	0.026	0.348	0.005🏆	0.062	0.080	0.053	0.111	0.899	0.087	0.022	0.025	0.059	1.532†	1.532
paraguay-itaipu-brazil-fraction	0.043	0.202	0.022	0.012🏆	0.119	0.065	0.756	0.566	0.092	0.690	1.020	0.841	0.871	0.749	0.245	1.020†	0.221
philippines-remittances-vs-timor-leste	0.373	0.093	0.183	0.080	0.159	0.141	0.119	0.080	0.100	0.054	0.161	0.120	0.119	0.113	0.076	0.048🏆	1.948
poland-air-quality-coal-miner-wage-years	0.056	0.050	0.099	0.264	0.140	0.041🏆	0.229	1.426	0.237	0.966	0.724	0.066	0.412	0.471	0.105	1.426†	0.850
saudi-desal-per-oil-barrel	0.157	0.171	0.095	0.117	0.532	0.447	0.174	0.532	0.551	0.287	0.721	0.424	0.056	0.045	0.025🏆	0.721†	0.319
saudi-oil-solar-replacement-multiple	0.156	0.029🏆	0.813	1.454	0.490	0.033	0.058	0.309	0.391	0.247	0.352	7.110	0.123	0.277	0.197	1.388	0.237
ssa-solar-ethiopia-gdp-years	0.132	1.003	0.148	0.015🏆	0.077	1.164	1.109	0.718	0.037	0.926	1.880	0.033	0.027	0.266	1.272	1.880†	1.109
tokyo-car-commute-lane-km	0.003🏆	0.274	0.010	0.735	1.063	0.260	1.140	0.896	0.396	1.263	0.315	0.499	0.453	0.483	0.108	1.263†	0.763
uruguay-share-coal-units-shutdown	0.063	0.028	0.298	0.054	0.035	0.129	0.035	0.005🏆	0.154	0.013	0.307	0.018	0.032	0.040	0.543	7.670†	7.670
vietnam-vs-myanmar-rice-per-worker	0.034	0.137	0.039	0.075	0.141	0.100	0.023	0.032	0.366	0.027	0.703	0.068	0.198	0.153	0.021🏆	0.073	0.130
world-paved-roads	0.255	0.035	0.136	0.174	0.385†	0.041	0.071	0.309	0.159	0.014	0.095	0.008🏆	0.209	0.222	0.385	0.385†	0.012

< 0.05 tight< 0.2 clean< 0.7 productive< 2 different interpretation≥ 2 suspect† imputed (worst-of-others; not a real answer)