Project Peak
← Studies
varies: trials per step (depth) 2 levels

gpt-5-nano · sampling depth

Same model and test, two sampling depths. A worked example of how the score sharpens (a tighter confidence interval) when you run more trials per step, and of how the confidence tiers and promotion work.

Score by level

Each level holds the model and the test fixed. Only trials per step (depth) changes. A higher score is better, and a tighter confidence interval is better still.

#1
depth 16 (standard)
ceiling found
33.6/100
CI 29.6 to 36.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H6 · 8K breaks H8 · 16K moderate
#2
depth 8 (sketch)
ceiling found
29.6/100
CI 27.6 to 33.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
sustains H6 · 8K breaks H8 · 16K moderate
Analysis generated by anthropic/claude-sonnet-4.6 · v1

Does running more trials change what we learned about gpt-5-nano?

Short version: not really. Going from 8 trials per step to 16 moved gpt-5-nano's score from 29.6 to 33.6. That is a 4-point change, and it sits almost entirely inside the two runs' overlapping confidence intervals. The limit itself stayed put. Both depths found the same last passing step (H6 at 8K) and the same first failing step (H8 at 16K).

What changed between the two runs

The only thing we varied was the number of trials per step: 8 in the "sketch" run, 16 in the "standard" run. The model (gpt-5-nano), the task, the ladder, and the scoring all stayed the same. So this is a clean test of whether sampling harder changes the answer.

Run Trials per step Score 95% CI Last passing step First failing step
Sketch 8 29.6 27.6 to 33.6 H6 at 8K H8 at 16K
Standard 16 33.6 29.6 to 36.6 H6 at 8K H8 at 16K

The point estimates are 4 points apart, but the intervals overlap heavily. The sketch run's upper bound (33.6) lands exactly on the standard run's point estimate. We can't call the difference real.

What the extra trials actually bought

At H8 at 16K the pass rate went from 0.75 to 0.875 with more trials. Neither clears the 0.90 bar, so the failing step is the same either way, but the extra data gave a clearer read on how close the model is to passing it. At H10 at 32K the pass rate dropped from 0.375 to 0.1875, which is what you expect on a step the model is well below. More trials pulled the estimate toward a lower and probably more honest number.

So what

If you just need a fast, cheap read on where a model stops, the sketch run already gets the structure right at half the cost. Spend the extra trials when a step is sitting near the decision line and you need a tighter interval to call it, which is the case for H8 here. Either way the verdict was the same: the limit is real and it is low.

Method note

Scores come from a ladder where each step raises both the number of reasoning hops and the context length at once. Tasks are synthetic and built to resist contamination. This study only looks at how sampling depth affects the precision of the estimate, not whether the limit would move under a different task or a different ladder.