Project Peak
Project Peak model rankings

How well a model holds up as its context fills with code.

We run each model up a ladder where the task gets harder and the context gets longer at the same time. Its score is the hardest step it still gets right about 90% of the time. Every model is graded against how it does on the easy steps, so the number says something about the model and not about how hard we made the test.

33.6 /100
gpt-5-nano · score (CI 29.6 to 36.6)
H6 · 8K
hardest step it holds
Leaderboard

Model rankings

Each model is scored by the hardest step it holds on the ladder. The bars show how often it passed at every step we tested, with the easy ones on the left and the hard ones on the right. The two dashed lines mark 90% and 80% reliability. A higher score is better, and a slower drop past the top step is better still.

#1
gpt-5-nano
ceiling found
33.6 /100
95% CI 29.6 to 36.6
H4
8K
H6
8K
H8
16K
H10
32K
H40
524K
H52
786K
H64
1049K
Sustains
H6 · 8K
Breaks at
H8 · 16K
Decline begins
33.6
Falloff
moderate