Transfer eval

REALFP — modern LLMs on the 2021 Fermi benchmark

We give a frontier LLM each question from AllenAI's 2021 REALFP benchmark (a set of real-world Fermi estimation problems) and score how close its single-number answer lands to the dataset's “gold” answer. The original paper used a fine-tuned T5 model with no prompt; we instead zero-shot prompt a modern model and ask it to show its reasoning.

Generated 5/21/2026, 8:22:17 PM · AllenAI REALFP test split (Kalyan et al. 2021, arXiv:2110.14207)

Paper (arXiv:2110.14207)Original repo (allenai/fermi)Test split (test_realfp.json)r/estimation

How the score works

fp_score = max(0, 1 - (1/3)|log10(A'/A)|)

1.000 = exact match
loses 1/3 for every 10× off (one order of magnitude)
0.000 = off by 1000× or more
“OOM err” = orders of magnitude the model is above (+) or below (−) the gold

⚠ How to review (read this first)

A low score does notalways mean the model failed. REALFP's gold answers are sometimes wrong — we already found cases where the model is right and the gold is broken (e.g. a physics question solved with the wrong formula).

For each low-scoring row, expand it and decide:

Model wrong — its reasoning has a real error.
Gold wrong — the gold answer / its facts are implausible (check the gold decomposition shown on the right).
Question ambiguous — under-specified, multiple defensible answers. These are candidates to rewrite for our bench.
Compute/structure error — approach is sound but the result is off due to arithmetic, decomposition structure, or a formatting / unit mismatch(model and gold are basically the same value but one is mis-formatted, so the score looks wrong). Also the catch-all for any failure the others don't capture — and it must be explained in a note.

⚠ Always write a short note explaining anything interesting or needed — what went wrong, why a gold looks broken, why a question is ambiguous. The notes are the point: they're what we mine to build the new benchmark.

Overall score

0.606average

The model's average score across all 557 questions. Each question is scored 0–1: 1.000 = exact answer, 0.000 = off by 1000× or more (or no usable answer).

median 0.737 — the middle question (half score higher, half lower). If it's much higher than the average, a few badly-scored questions are dragging the average down.

557 of 557 questions scored.

For scale: on this exact task in 2021, the best fine-tuned model (T5) scored 0.21, a fixed constant guess 0.22, and a regression model 0.32. So anything well above ~0.3 already beats everything from 2021.

By difficulty

Same score, grouped by how many steps the gold solution uses (more steps = harder).

deep (5+)0.602 (n=49)
medium (3-4)0.599 (n=122)
shallow (<=2)0.608 (n=386)

By gold magnitude

Same score, grouped by how big the correct answer is (e.g. tiny fractions vs. huge counts).

1e0–1e30.671 (n=156)
1e3–1e60.659 (n=146)
1e6–1e90.618 (n=97)
1e9+0.530 (n=105)
<1 (sub-unit)0.391 (n=53)

Reviewer0 reviewed

Verdicts save automatically in this browser. When done, set your name, click Download, and send the file to Jorge to merge.

Sort:

Difficulty:

Verdict:

557 questions

qid

question

gold

model

OOM err

score