OpenAI’s Preparedness Framework plays a crucial role in assessing the capabilities of AI models, especially in terms of their autonomy in performing complex tasks like software engineering. The ability of these models to autonomously complete such tasks is a key metric within the Medium risk level of the Model Autonomy risk category. However, evaluating these capabilities is inherently challenging due to the complexity of software tasks, the difficulty in assessing generated code, and the challenge of simulating real-world development scenarios.
One of the primary tools used in this evaluation is SWE-bench, a popular benchmark designed to assess the ability of large language models (LLMs) to resolve real-world software issues sourced from GitHub. SWE-bench tasks agents with generating patches for code repositories to fix described issues. Despite the impressive progress demonstrated by coding agents—achieving scores of 20% on SWE-bench and 43% on SWE-bench Lite—the evaluation process has highlighted certain limitations.
Our analysis revealed that some SWE-bench tasks might be unsolvable, leading to an underestimation of models’ capabilities. To address this, we collaborated with SWE-bench’s creators to refine the benchmark, resulting in a more accurate assessment tool.
Understanding SWE-bench
SWE-bench evaluates AI models by providing them with GitHub issues from 12 open-source Python repositories. Each issue includes a pull request (PR) containing solution code and unit tests. These tests, known as FAIL_TO_PASS, initially fail but pass after the solution is applied, verifying the solution’s effectiveness. Additionally, PASS_TO_PASS tests ensure that unrelated functionality remains unaffected.
Enhancing Preparedness Evaluation with SWE-bench
Recognizing the potential of SWE-bench for OpenAI’s Preparedness Framework, we identified three key areas for improvement:
- Overly Specific Unit Tests: Some tests were too specific or irrelevant, causing correct solutions to be unfairly rejected.
- Ambiguous Issue Descriptions: Many descriptions were underspecified, leading to confusion about the problem and its solution.
- Setup Challenges: Difficulties in setting up development environments led to erroneous test failures, misrepresenting valid solutions as incorrect.
SWE-bench Verified
To address these issues, we launched SWE-bench Verified, a refined version of the original test set. This new version, created in collaboration with SWE-bench authors, includes 500 samples that have been thoroughly vetted by professional software developers. Additionally, we introduced a new evaluation harness using containerized Docker environments to ensure more reliable and accurate assessments.
More Information click here
Through these improvements, GPT-4 achieved a resolution rate of 33.2% on SWE-bench Verified, with top open-source scaffolds doubling their previous performance. This collaboration and refinement represent a significant step forward in accurately evaluating the autonomy of AI models in software engineering, furthering OpenAI’s mission to develop robust and reliable AI systems.