GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

In each task, an agent is given a codebase and a performance test as a precise specification, and must improve runtime efficiency to match an expert developer's optimization.

102 tasks · 10 codebases · 5 languages · Learn more →

Leaderboard

Loading leaderboard...

Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests. See paper for details.

NEW Hack-Adjusted: We observe that top models can perform deceptive optimizations (e.g., memoization, harness hijacking, etc.). To tackle this we introduce a new "Hack Detector" system that penalizes reward hacks by comparing the model's patch with the oracle solution and the test cases. The "Hack-Adjusted" column shows the adjusted score after penalizing detected hacks. Learn more.

Opt@1 vs Speedup Threshold (p)

Opt_p@1: Estimate of fraction of tasks where a single attempt achieves ≥p% human speedup and passes correctness tests.

p=0.95 is our default threshold, but it can be a knob for difficulty for a task; i.e., p=0 evaluates if the agent's patch is correct, regardless of performance, while p=1 evaluates if the agent's patch is identical/better in performance to the human commit.

Compute Profile

Median wall-clock time and turns per task spent by the agent vs GSO score. Time measures how long the agent spent working from first to last event in a trajectory. Turns measures the number of steps taken by the agent in a trajectory.