SWE-bench Verified - OpenAI
Summary
OpenAI collaborated with software developers to improve the SWE-bench benchmark by identifying and filtering out problematic test samples. The resulting SWE-bench Verified provides a more reliable evaluation of AI models' software engineering skills.
Review
OpenAI's SWE-bench Verified represents a significant advancement in AI model evaluation for software engineering tasks. By systematically screening 1,699 samples with 93 professional software developers, they identified critical issues in the original benchmark that could systematically underestimate AI models' capabilities. The key problems included underspecified issue descriptions, overly specific or unrelated unit tests, and unreliable development environment setups.
The research methodology involved a rigorous human annotation process where each sample was labeled three times across multiple criteria, including problem specification clarity, test validity, and task difficulty. This approach led to filtering out 68.3% of the original samples, resulting in a more robust 500-sample dataset. Notably, the GPT-4o model's performance improved from 16% to 33.2% on this verified dataset, demonstrating that the original benchmark was indeed constraining. The work highlights the importance of continuous improvement in AI evaluation benchmarks and the need for careful, nuanced assessment of AI capabilities.
Key Points
- Human-validated benchmark that addresses limitations in original SWE-bench dataset
- 68.3% of original samples filtered due to evaluation inconsistencies
- Performance improvements show previous benchmarks underestimated AI capabilities