The score (0 to 100)
The score is the hardest step a model still gets right about 90% of the time, put on a 0 to 100 scale. Higher means it holds a harder task. We grade each model against how it does on the easy steps, so the score reflects the model rather than how hard the test is in the abstract.