Transfer eval

Science Olympiad Fermi — order-of-magnitude eval

Modern LLMs on a curated subset of the Open Science Olympiad Fermi questions. Each answer is scored only on its order of magnitude(the power of 10 nearest the true value), the way the Science Olympiad event scores it. We compare against TextQL's May-2025 eval.

Generated 5/22/2026, 12:55:08 AM · Open Science Olympiad Fermi (curated subset)

Practice site Repo (open-scioly-fermi)TextQL blog (May 2025)

How the metrics work

Strict — exact order-of-magnitude match.
Practical — within ±1 order of magnitude.
Reliable — within ±3 (i.e. didn't hallucinate wildly).
“OOM err” = model's exponent minus the gold exponent.

⚠ Comparison caveat & how to review

Our subset is randomly sampledfrom the noisy full set; TextQL hand-curated theirs. So raw numbers aren't perfectly comparable — some misses are bad questions, not model errors. Expand a row and tag it: model wrong / gold wrong / ambiguous / compute-error. Always write a note — the notes are what we mine for the new benchmark.

Results vs TextQL (May 2025)

Model	Strict	Practical (±1)	Reliable (±3)	n
gpt-5.5-2026-04-23 (ours)	60.0%	88.5%	95.5%	200
Claude 3.7 Sonnet (TextQL)	61.3%	93.7%	98.1%	158
GPT-4o (TextQL)	56.5%	86.9%	95.6%	158

mean signed OOM error +0.231 (≈0 = unbiased) · 1 parse fails · error bands: exact:120 · >3 (hallucination):8 · ±1:57 · ±2–3:14 · no answer:1

Different subsets (ours randomly sampled, TextQL hand-curated) — read as ballpark, not a head-to-head. See the review caveat above.

Reviewer0 reviewed

Verdicts save automatically in this browser. When done, set your name, click Download, and send the file to Jorge to merge.

Sort:

Verdict:

200 questions

qid

question

gold 10ⁿ

model 10ⁿ

OOM err

tier