gpt-5-nano · sampling depth
Same model and test, two sampling depths. A worked example of how the score sharpens (a tighter confidence interval) when you run more trials per step, and of how the confidence tiers and promotion work.
Score by level
Each level holds the model and the test fixed. Only trials per step (depth) changes. A higher score is better, and a tighter confidence interval is better still.
Does running more trials change what we learned about gpt-5-nano?
Short version: not really. Going from 8 trials per step to 16 moved gpt-5-nano's score from 29.6 to 33.6. That is a 4-point change, and it sits almost entirely inside the two runs' overlapping confidence intervals. The limit itself stayed put. Both depths found the same last passing step (H6 at 8K) and the same first failing step (H8 at 16K).
What changed between the two runs
The only thing we varied was the number of trials per step: 8 in the "sketch" run, 16 in the "standard" run. The model (gpt-5-nano), the task, the ladder, and the scoring all stayed the same. So this is a clean test of whether sampling harder changes the answer.
| Run | Trials per step | Score | 95% CI | Last passing step | First failing step |
|---|---|---|---|---|---|
| Sketch | 8 | 29.6 | 27.6 to 33.6 | H6 at 8K | H8 at 16K |
| Standard | 16 | 33.6 | 29.6 to 36.6 | H6 at 8K | H8 at 16K |
The point estimates are 4 points apart, but the intervals overlap heavily. The sketch run's upper bound (33.6) lands exactly on the standard run's point estimate. We can't call the difference real.
What the extra trials actually bought
At H8 at 16K the pass rate went from 0.75 to 0.875 with more trials. Neither clears the 0.90 bar, so the failing step is the same either way, but the extra data gave a clearer read on how close the model is to passing it. At H10 at 32K the pass rate dropped from 0.375 to 0.1875, which is what you expect on a step the model is well below. More trials pulled the estimate toward a lower and probably more honest number.
So what
If you just need a fast, cheap read on where a model stops, the sketch run already gets the structure right at half the cost. Spend the extra trials when a step is sitting near the decision line and you need a tighter interval to call it, which is the case for H8 here. Either way the verdict was the same: the limit is real and it is low.
Method note
Scores come from a ladder where each step raises both the number of reasoning hops and the context length at once. Tasks are synthetic and built to resist contamination. This study only looks at how sampling depth affects the precision of the estimate, not whether the limit would move under a different task or a different ladder.