Sentinelsentinel-fermi-bench

Science Olympiad Fermi — order-of-magnitude eval

Modern LLMs on a curated subset of the Open Science Olympiad Fermi questions. Each answer is scored only on its order of magnitude(the power of 10 nearest the true value), the way the Science Olympiad event scores it. We compare against TextQL's May-2025 eval.

Generated 5/22/2026, 12:55:08 AM · Open Science Olympiad Fermi (curated subset)

How the metrics work

  • Strict — exact order-of-magnitude match.
  • Practical — within ±1 order of magnitude.
  • Reliable — within ±3 (i.e. didn't hallucinate wildly).
  • “OOM err” = model's exponent minus the gold exponent.

⚠ Comparison caveat & how to review

Our subset is randomly sampledfrom the noisy full set; TextQL hand-curated theirs. So raw numbers aren't perfectly comparable — some misses are bad questions, not model errors. Expand a row and tag it: model wrong / gold wrong / ambiguous / compute-error. Always write a note — the notes are what we mine for the new benchmark.

Results vs TextQL (May 2025)

ModelStrictPractical (±1)Reliable (±3)n
gpt-5.5-2026-04-23 (ours)60.0%88.5%95.5%200
Claude 3.7 Sonnet (TextQL)61.3%93.7%98.1%158
GPT-4o (TextQL)56.5%86.9%95.6%158

mean signed OOM error +0.231 (≈0 = unbiased) · 1 parse fails · error bands: exact:120 · >3 (hallucination):8 · ±1:57 · ±2–3:14 · no answer:1

Different subsets (ours randomly sampled, TextQL hand-curated) — read as ballpark, not a head-to-head. See the review caveat above.

0 reviewed

Verdicts save automatically in this browser. When done, set your name, click Download, and send the file to Jorge to merge.

Sort:
Verdict:
200 questions
qid
question
gold 10ⁿ
model 10ⁿ
OOM err
tier