Skip to main content Skip to navigation

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Project Overview

The document explores the application of generative AI, particularly large language models such as ChatGPT and GPT-4, in the realm of programming education. It presents various scenarios where these AI models function as digital tutors, assistants, or collaborative peers for students. A comparative study benchmarks the performance of these AI models against human tutors in multiple educational contexts, showing that GPT-4 not only surpasses ChatGPT but also closely matches human effectiveness in several tasks, despite encountering challenges in specific areas. The findings indicate that generative AI holds significant potential for enhancing programming education, highlighting various applications while also acknowledging its limitations and the need for further research to address these challenges.

Key Applications

Program Assistance

Context: Assisting students in programming tasks through various means such as repairing code, providing hints, grading feedback, and generating contextual explanations.

Implementation: Utilized AI to assist students by generating hints for debugging, repairing buggy programs based on problem descriptions, providing automated grading feedback against a rubric, generating detailed explanations for specific parts of correct programs, and creating new programming tasks derived from student errors.

Outcomes: GPT-4 achieved an overall effectiveness of 84% in providing contextualized explanations, 88% correctness in program repair, 66% in hint generation, and 64% in pair programming assistance. However, it struggled with grading feedback (16% match with human grading) and task synthesis (22%). ChatGPT performed lower across these tasks, with significant gaps compared to human tutors.

Challenges: AI faced challenges such as inaccuracies in grading feedback, variability in hint quality, excessive edits leading to loss of context in pair programming, and difficulties in capturing the essence of student errors for task synthesis.

Implementation Barriers

Technical Limitations

Current models like GPT-4 struggle with complex scenarios like grading feedback, task synthesis, and maintaining context in edits to student code.

Proposed Solutions: Future work should focus on improving model capabilities, integrating more educational context, and enhancing training data to include more examples of context-sensitive programming tasks.

Quality of Output

Hints and feedback provided by AI can vary in quality, sometimes being incorrect or unhelpful.

Proposed Solutions: Implementing stricter evaluation metrics and human oversight for output quality.

Project Team

Tung Phung

Researcher

Victor-Alexandru Pădurean

Researcher

José Cambronero

Researcher

Sumit Gulwani

Researcher

Tobias Kohn

Researcher

Rupak Majumdar

Researcher

Adish Singla

Researcher

Gustavo Soares

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies