Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
Project Overview
The document explores the application of generative AI, particularly large language models such as ChatGPT and GPT-4, in the realm of programming education. It presents various scenarios where these AI models function as digital tutors, assistants, or collaborative peers for students. A comparative study benchmarks the performance of these AI models against human tutors in multiple educational contexts, showing that GPT-4 not only surpasses ChatGPT but also closely matches human effectiveness in several tasks, despite encountering challenges in specific areas. The findings indicate that generative AI holds significant potential for enhancing programming education, highlighting various applications while also acknowledging its limitations and the need for further research to address these challenges.
Key Applications
Program Assistance
Context: Assisting students in programming tasks through various means such as repairing code, providing hints, grading feedback, and generating contextual explanations.
Implementation: Utilized AI to assist students by generating hints for debugging, repairing buggy programs based on problem descriptions, providing automated grading feedback against a rubric, generating detailed explanations for specific parts of correct programs, and creating new programming tasks derived from student errors.
Outcomes: GPT-4 achieved an overall effectiveness of 84% in providing contextualized explanations, 88% correctness in program repair, 66% in hint generation, and 64% in pair programming assistance. However, it struggled with grading feedback (16% match with human grading) and task synthesis (22%). ChatGPT performed lower across these tasks, with significant gaps compared to human tutors.
Challenges: AI faced challenges such as inaccuracies in grading feedback, variability in hint quality, excessive edits leading to loss of context in pair programming, and difficulties in capturing the essence of student errors for task synthesis.
Implementation Barriers
Technical Limitations
Current models like GPT-4 struggle with complex scenarios like grading feedback, task synthesis, and maintaining context in edits to student code.
Proposed Solutions: Future work should focus on improving model capabilities, integrating more educational context, and enhancing training data to include more examples of context-sensitive programming tasks.
Quality of Output
Hints and feedback provided by AI can vary in quality, sometimes being incorrect or unhelpful.
Proposed Solutions: Implementing stricter evaluation metrics and human oversight for output quality.
Project Team
Tung Phung
Researcher
Victor-Alexandru Pădurean
Researcher
José Cambronero
Researcher
Sumit Gulwani
Researcher
Tobias Kohn
Researcher
Rupak Majumdar
Researcher
Adish Singla
Researcher
Gustavo Soares
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai