Project Peak
Learn

Reading the score

Project Peak reports one headline number per model: how hard a code task it still handles as the context grows. This page explains what that number means and how to use it. It covers how to read the results, not how the tasks are built.

The score (0 to 100)

The score is the hardest step a model still gets right about 90% of the time, put on a 0 to 100 scale. Higher means it holds a harder task. We grade each model against how it does on the easy steps, so the score reflects the model rather than how hard the test is in the abstract.

The coupled ladder

Difficulty rises along a ladder where two things climb at once: the number of reasoning steps the task needs and the context length it lives in (say 6 steps at 8K tokens up to 64 steps at 1M). Coupling them is deliberate. Real long-context work is hard because you have to reason across more material, not just read more of it.

The confidence interval

Every score carries a bootstrap confidence interval, the 10th to 90th percentile band around it. A wide band means we should sample more. A tight one means the estimate has settled. That is the lever behind the tiers: sketch, standard, and rigorous buy a tighter band with more trials per step.

Falloff shape

Two models with the same score can fail differently. Decline begins marks where reliability first dips below the model's own plateau. The half-life marks where it has fallen halfway. The slope says whether that is a sharp cliff or a gentle decay. The shape is how you size your margin.

Using the numbers

Three ways teams use the score

Model routing

Route by the score, not the advertised window

Two models can advertise the same window and still differ a lot in how hard a task they hold as context grows. Work out roughly where your task sits (how many reasoning steps, at what length) and send it to a model whose score clears that point. That keeps you in the range the model actually handles rather than the one on the spec sheet.

Margin budgeting

Budget by the falloff shape

Past its limit, a sharp-cliff model collapses fast while a gentle-decay model fades slowly. For a blocking, auto-merged check, stay well inside the limit of a sharp-cliff model. A gentle-decay model gives you more usable room near its edge. The shape tells you how much headroom a given task needs.

Confidence tiers

Pay for the precision you need

A sketch profile finds the limit cheaply but with a wide interval. The standard and rigorous tiers run more trials per step and buy a tighter one. Start with a sketch to rank a field, then promote the models you care about to a tighter tier. The score sharpens in place and reuses the trials you already paid for.

A note on method

Measurements use synthetic, procedurally generated code tasks (no scraped code, so no training contamination), scored exactly and deterministically. A graded sequential sampler reads the model's behavior in a probe or two, then concentrates trials near the limit and stops once the confidence interval is tight, so the score is accurate without costing more than it needs to. We report where the limit is and how confident we are in it. We do not publish enough task internals for anyone to rebuild the generator.