Skip to main content Skip to navigation

Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering

Project Overview

The document explores the use of Generative AI, specifically GPT-3.5, in the education sector, focusing on its application for auto-grading written responses in a physics course. It introduces an innovative prompting technique known as 'scaffolded chain of thought (COT),' which enhances grading accuracy significantly compared to conventional grading methods. The findings reveal that AI-driven grading systems can reach human-level accuracy, thereby alleviating the grading burden on instructors and improving efficiency in educational assessments. However, the implementation of such technology is not without challenges, including the potential for AI hallucinations and the necessity for meticulous prompt engineering to ensure reliable outcomes. Overall, the study highlights the promising potential of Generative AI in transforming educational practices, while also acknowledging the hurdles that need to be addressed for effective integration.

Key Applications

Scaffolded Chain of Thought (COT) for grading student responses

Context: Large public research university; calculus-based introductory physics course focused on Mechanics with 99 students

Implementation: GPT-3.5 was used with scaffolded COT prompting to assess student explanations for incorrect answers to a physics problem.

Outcomes: AI grading accuracy improved by 20% - 30% compared to conventional methods, with a level of agreement between AI and human raters reaching 70% - 80%.

Challenges: LLMs may generate hallucinated outputs that lead to incorrect grading. Prompt engineering is necessary to mitigate this issue.

Implementation Barriers

Technical barrier

LLMs can generate hallucinations, leading to factually incorrect outputs and low agreement with human graders. Effective prompt engineering requires careful design and testing, which can also result in resource intensiveness.

Proposed Solutions: Techniques such as prompt engineering, fine-tuning, retrieval augmented generation, and few-shot learning can be employed to reduce hallucination. Iterative development and testing of prompts can optimize grading outcomes.

Project Team

Zhongzhou Chen

Researcher

Tong Wan

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Zhongzhou Chen, Tong Wan

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies