#1
gpt-5-nano
ceiling found
33.6 /100
95% CI 29.6 to 36.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
- Sustains
- H6 · 8K
- Breaks at
- H8 · 16K
- Decline begins
- 33.6
- Falloff
- moderate
We run each model up a ladder where the task gets harder and the context gets longer at the same time. Its score is the hardest step it still gets right about 90% of the time. Every model is graded against how it does on the easy steps, so the number says something about the model and not about how hard we made the test.
Each model is scored by the hardest step it holds on the ladder. The bars show how often it passed at every step we tested, with the easy ones on the left and the hard ones on the right. The two dashed lines mark 90% and 80% reliability. A higher score is better, and a slower drop past the top step is better still.