SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Project Overview

The document explores the integration of generative AI in education, particularly through the development of SWT-Bench, a benchmark designed for generating tests aimed at reproducing issues in Python code. It highlights the transformative capabilities of Large Language Models (LLMs) and Code Agents in automating the test generation process, which can significantly enhance software quality and boost developer productivity. By formalizing user-reported issues into test cases that can be automatically generated and evaluated against real-world software repositories, these AI tools streamline the software testing workflow. Key findings reveal that Code Agents surpass traditional testing methods, showcasing their effectiveness in identifying and addressing software issues more efficiently. Overall, the research underscores the potential of generative AI to revolutionize educational practices related to software development and testing, facilitating better learning outcomes and fostering a more productive development environment.

Key Applications

SWT-Bench

Context: Software development and testing, targeting software engineers and developers.

Implementation: Utilized a dataset of real-world GitHub issues and code repositories to validate the performance of LLMs and Code Agents for test generation.

Outcomes: Demonstrated improved test generation success rates and increased precision in identifying correct code fixes through generated tests.

Challenges: Limitations include the focus on Python, potential data contamination from pre-trained models, and challenges in generating contextually relevant tests.

Implementation Barriers

Technical

The lack of large-scale, diverse test-generation datasets specifically tailored for Python limits the effectiveness of automated test generation.

Proposed Solutions: Developing new datasets that encompass a wider range of software issues and testing scenarios.

Implementation

Generated tests may not always reproduce the described issues accurately, leading to unreliable outcomes.

Proposed Solutions: Implementing rigorous validation processes for generated tests to ensure they effectively capture and reproduce the relevant issues.

Project Team

Niels Mündler

Researcher

Mark Niklas Müller

Researcher

Jingxuan He

Researcher

Martin Vechev

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Niels Mündler, Mark Niklas Müller, Jingxuan He, Martin Vechev

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects