GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
In each task, an agent is given a codebase and a performance test as a precise specification, and must improve runtime efficiency to match an expert developer's optimization.
Leaderboard
Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests. See paper for details.
Scaffold: All models are run with the OpenHands scaffold unless otherwise specified.
Changelog:
2026-04-27: Improved elicitation: increased max_iterations (inference compute) to 200 (2x) for new runs, and using reasoning_effort for Claude models like Opus 4.6 for better comparison against newer models elicited with thinking.
2026-04-27: Upgraded the Hack Detector model to GPT-5.4 (xhigh).
2025-11-03: Introduced the Hack Detector: penalizes deceptive optimizations (e.g., memoization, harness hijacking) by comparing the model's patch against the oracle solution and test cases. The "Hack-Adjusted" column shows scores after this penalty. Learn more.
Opt@1 vs Speedup Threshold (p)
Optp@1: Estimate of fraction of tasks where a single attempt achieves ≥p% human speedup and passes correctness tests.
p=0.95 is our default threshold, but it can be a knob for difficulty for a task; i.e., p=0 evaluates if the agent's patch is correct, regardless of performance, while p=1 evaluates if the agent's patch is identical/better in performance to the human commit.
Compute Profile
Median wall-clock time and turns per task spent by the agent vs GSO score. Time measures how long the agent spent working from first to last event in a trajectory. Turns measures the number of steps taken by the agent in a trajectory.