Skip to main content Skip to navigation

Benchmarking Educational Program Repair

Project Overview

The document explores the impactful role of large language models (LLMs) in education, with a specific focus on programming instruction. It highlights the ability of LLMs to create learning materials, deliver personalized feedback, and assist in automated program repair, showcasing their versatility as educational tools. The need for standardized benchmarks in the research of AI applications in education is underlined, advocating for a new benchmark specifically designed for assessing program repair capabilities. Additionally, the document addresses the evaluation methods used to measure the effectiveness of LLMs in generating constructive feedback and fixing student code, while tackling the complexities associated with the use of tailored datasets for comparison purposes. Overall, the findings suggest that integrating generative AI into educational settings can enhance learning outcomes and improve student engagement in programming courses.

Key Applications

Automated program repair using large language models (LLMs)

Context: Programming education, specifically for novice programmers in university courses.

Implementation: Two educational scenarios were established: one with prior data available and one without. LLMs were trained and evaluated using curated datasets such as FalconCode and a dataset from the National University of Singapore.

Outcomes: Enhanced automated feedback for students, improved debugging support through program repairs, and the establishment of a benchmark for evaluating program repair techniques.

Challenges: Lack of standardized datasets for comparison, potential overfitting of models to specific datasets, and limitations in capturing the diversity of programming issues encountered by students.

Implementation Barriers

Data Quality and Availability

Many studies utilize bespoke datasets, making it difficult to compare results across different research efforts.

Proposed Solutions: The proposal of standardized benchmarks and publicly available datasets like FalconCode and the Singapore dataset to facilitate equitable comparisons.

Model Performance Limitations

Fine-tuned models can overfit on specific datasets, affecting their generalization to new problems.

Proposed Solutions: Use of diverse datasets and continuous model training as more data becomes available to improve performance across various contexts.

Project Team

Charles Koutcheme

Researcher

Nicola Dainese

Researcher

Sami Sarsa

Researcher

Juho Leinonen

Researcher

Arto Hellas

Researcher

Paul Denny

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Charles Koutcheme, Nicola Dainese, Sami Sarsa, Juho Leinonen, Arto Hellas, Paul Denny

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies