Project Peak model rankings

How well a model holds up as its context fills with code.

Name: Project Peak: model capability ceiling for long-context code reasoning
Creator: Project Peak
License: https://creativecommons.org/licenses/by/4.0/

We run each model up a ladder where the task gets harder and the context gets longer at the same time. Its score is the hardest step it still gets right about 90% of the time. Every model is graded against how it does on the easy steps, so the number says something about the model and not about how hard we made the test.

33.6 /100

gpt-5-nano · score (CI 29.6 to 36.6)

H6 · 8K

hardest step it holds

View the leaderboard Latest study: gpt-5-nano · sampling depth →

Leaderboard

Model rankings

Each model is scored by the hardest step it holds on the ladder. The bars show how often it passed at every step we tested, with the easy ones on the left and the hard ones on the right. The two dashed lines mark 90% and 80% reliability. A higher score is better, and a slower drop past the top step is better still.

gpt-5-nano

ceiling found

33.6 /100

95% CI 29.6 to 36.6

16K

H10

32K

H40

524K

H52

786K

H64

1049K

Sustains: H6 · 8K
Breaks at: H8 · 16K
Decline begins: 33.6
Falloff: moderate