GSO Leaderboard
Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests. See paper for details.
NEW Hack-Adjusted: We observe that top models can perform deceptive optimizations (e.g., memoization, harness hijacking, etc.). To tackle this we introduce a new "Hack Detector" system that penalizes reward hacks by comparing the model's patch with the oracle solution and the test cases. The "Hack-Adjusted" column shows the adjusted score after penalizing detected hacks. Learn more.
Opt@1 vs Speedup Threshold (p)
Optp@1: Estimate of fraction of tasks where a single attempt achieves ≥p% human speedup and passes correctness tests.
p=0.95 is our default threshold, but it can be a knob for difficulty for a task; i.e., p=0 evaluates if the agent's patch is correct, regardless of performance, while p=1 evaluates if the agent's patch is identical/better in performance to the human commit.