REALFP — modern LLMs on the 2021 Fermi benchmark
We give a frontier LLM each question from AllenAI's 2021 REALFP benchmark (a set of real-world Fermi estimation problems) and score how close its single-number answer lands to the dataset's “gold” answer. The original paper used a fine-tuned T5 model with no prompt; we instead zero-shot prompt a modern model and ask it to show its reasoning.
Generated 5/21/2026, 8:22:17 PM · AllenAI REALFP test split (Kalyan et al. 2021, arXiv:2110.14207)
How the score works
fp_score = max(0, 1 - (1/3)|log10(A'/A)|)- 1.000 = exact match
- loses 1/3 for every 10× off (one order of magnitude)
- 0.000 = off by 1000× or more
- “OOM err” = orders of magnitude the model is above (+) or below (−) the gold
⚠ How to review (read this first)
A low score does notalways mean the model failed. REALFP's gold answers are sometimes wrong — we already found cases where the model is right and the gold is broken (e.g. a physics question solved with the wrong formula).
For each low-scoring row, expand it and decide:
- Model wrong — its reasoning has a real error.
- Gold wrong — the gold answer / its facts are implausible (check the gold decomposition shown on the right).
- Question ambiguous — under-specified, multiple defensible answers. These are candidates to rewrite for our bench.
- Compute/structure error — approach is sound but the result is off due to arithmetic, decomposition structure, or a formatting / unit mismatch(model and gold are basically the same value but one is mis-formatted, so the score looks wrong). Also the catch-all for any failure the others don't capture — and it must be explained in a note.
⚠ Always write a short note explaining anything interesting or needed — what went wrong, why a gold looks broken, why a question is ambiguous. The notes are the point: they're what we mine to build the new benchmark.
Overall score
The model's average score across all 557 questions. Each question is scored 0–1: 1.000 = exact answer, 0.000 = off by 1000× or more (or no usable answer).
median 0.737 — the middle question (half score higher, half lower). If it's much higher than the average, a few badly-scored questions are dragging the average down.
557 of 557 questions scored.
For scale: on this exact task in 2021, the best fine-tuned model (T5) scored 0.21, a fixed constant guess 0.22, and a regression model 0.32. So anything well above ~0.3 already beats everything from 2021.
By difficulty
Same score, grouped by how many steps the gold solution uses (more steps = harder).
- deep (5+)0.602 (n=49)
- medium (3-4)0.599 (n=122)
- shallow (<=2)0.608 (n=386)
By gold magnitude
Same score, grouped by how big the correct answer is (e.g. tiny fractions vs. huge counts).
- 1e0–1e30.671 (n=156)
- 1e3–1e60.659 (n=146)
- 1e6–1e90.618 (n=97)
- 1e9+0.530 (n=105)
- <1 (sub-unit)0.391 (n=53)
Verdicts save automatically in this browser. When done, set your name, click Download, and send the file to Jorge to merge.