varies: trials per step (depth) 2 levels

gpt-5-nano · sampling depth

Same model and test, two sampling depths. A worked example of how the score sharpens (a tighter confidence interval) when you run more trials per step, and of how the confidence tiers and promotion work.

Score by level

Each level holds the model and the test fixed. Only trials per step (depth) changes. A higher score is better, and a tighter confidence interval is better still.

depth 16 (standard)

ceiling found

33.6/100

CI 29.6 to 36.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H6 · 8K breaks H8 · 16K moderate

depth 8 (sketch)

ceiling found

29.6/100

CI 27.6 to 33.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

sustains H6 · 8K breaks H8 · 16K moderate

Analysis generated by anthropic/claude-sonnet-4.6 · v1

Does running more trials change what we learned about gpt-5-nano?

Short version: not really. Going from 8 trials per step to 16 moved gpt-5-nano's score from 29.6 to 33.6. That is a 4-point change, and it sits almost entirely inside the two runs' overlapping confidence intervals. The limit itself stayed put. Both depths found the same last passing step (H6 at 8K) and the same first failing step (H8 at 16K).

What changed between the two runs

The only thing we varied was the number of trials per step: 8 in the "sketch" run, 16 in the "standard" run. The model (gpt-5-nano), the task, the ladder, and the scoring all stayed the same. So this is a clean test of whether sampling harder changes the answer.

Run	Trials per step	Score	95% CI	Last passing step	First failing step
Sketch	8	29.6	27.6 to 33.6	H6 at 8K	H8 at 16K
Standard	16	33.6	29.6 to 36.6	H6 at 8K	H8 at 16K

The point estimates are 4 points apart, but the intervals overlap heavily. The sketch run's upper bound (33.6) lands exactly on the standard run's point estimate. We can't call the difference real.

What the extra trials actually bought

At H8 at 16K the pass rate went from 0.75 to 0.875 with more trials. Neither clears the 0.90 bar, so the failing step is the same either way, but the extra data gave a clearer read on how close the model is to passing it. At H10 at 32K the pass rate dropped from 0.375 to 0.1875, which is what you expect on a step the model is well below. More trials pulled the estimate toward a lower and probably more honest number.

So what

If you just need a fast, cheap read on where a model stops, the sketch run already gets the structure right at half the cost. Spend the extra trials when a step is sitting near the decision line and you need a tighter interval to call it, which is the case for H8 here. Either way the verdict was the same: the limit is real and it is low.

Method note

Scores come from a ladder where each step raises both the number of reasoning hops and the context length at once. Tasks are synthetic and built to resist contamination. This study only looks at how sampling depth affects the precision of the estimate, not whether the limit would move under a different task or a different ladder.