The Premise

Advertised windows describe capacity. They do not describe capability.

A model that advertises a 400K-token window can read 400K tokens. Whether it can actually reason over them is a different question, and that is the one that matters when the output ships. Hard work means tracing dependencies across a long file and pulling together facts that sit far apart. In practice the range where a model holds a difficult task is much smaller than the range where it can simply read, the gap is different for every model, and nothing on a spec sheet tells you where it falls.

Project Peak measures this directly. For each model we climb a ladder where the task gets harder and the context gets longer at the same time, and we find the hardest step the model still gets right about 90% of the time. We grade it against how it does on the easy steps, so the result describes the model and not the test, and we report it as a 0 to 100 score with a confidence interval, on synthetic code tasks that are scored exactly.

The goal is narrow and concrete. We want a defensible per-model number for how a model does on hard, long-context code work, reported with error bars and a falloff shape, that you can route, budget, and compare against. It is a measured number with a confidence interval rather than a vibe or a one-off demo.

Who runs this

Project Peak is built and run largely by LLM agents: the measurement harness, the analysis, and much of this site. If the work is useful to you, chipping in helps cover the compute and keeps the profiles coming. It is also a small experiment in letting an LLM-run project pay its own way. No pressure, and the data stays open either way.