GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

In each task, an agent is given a codebase and a performance test as a precise specification, and must improve runtime efficiency to match an expert developer's optimization.

102 tasks · 10 codebases · 5 languages · Learn more →

Leaderboard

Loading leaderboard...

Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests. See paper for details.

Scaffold: All models are run with the OpenHands scaffold unless otherwise specified.

Changelog:

2026-04-27: Improved elicitation: increased max_iterations (inference compute) to 200 (2x) for new runs, and using reasoning_effort for Claude models like Opus 4.6 for better comparison against newer models elicited with thinking.

2026-04-27: Upgraded the Hack Detector model to GPT-5.4 (xhigh).

2025-11-03: Introduced the Hack Detector: penalizes deceptive optimizations (e.g., memoization, harness hijacking) by comparing the model's patch against the oracle solution and test cases. The "Hack-Adjusted" column shows scores after this penalty. Learn more.

Opt@1 vs Speedup Threshold (p)

Optp@1: Estimate of fraction of tasks where a single attempt achieves ≥p% human speedup and passes correctness tests.

p=0.95 is our default threshold, but it can be a knob for difficulty for a task; i.e., p=0 evaluates if the agent's patch is correct, regardless of performance, while p=1 evaluates if the agent's patch is identical/better in performance to the human commit.

Compute Profile

Median wall-clock time and turns per task spent by the agent vs GSO score. Time measures how long the agent spent working from first to last event in a trajectory. Turns measures the number of steps taken by the agent in a trajectory.