Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics
Project Overview
The document explores the application of large language models (LLMs) in automating the grading of undergraduate physics assessments within the educational landscape. It emphasizes the promise of LLMs to streamline the grading process, offering quicker feedback and minimizing human bias. Nevertheless, the study indicates that AI-generated grades are presently less dependable than those given by human evaluators, primarily due to challenges such as mathematical inaccuracies and the phenomenon of hallucinations in AI responses. Notably, the implementation of a structured marking scheme significantly boosts the accuracy of AI grading. The findings suggest a positive correlation between the problem-solving capabilities of LLMs and their effectiveness in grading, implying that advancements in AI technology could enhance educational outcomes in the future. Overall, while generative AI holds substantial potential for transforming educational assessment, its current limitations highlight the necessity for continued development and refinement.
Key Applications
AI grading of undergraduate physics problems using LLMs (e.g., GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro)
Context: Higher education, targeting undergraduate physics students at University College London
Implementation: Empirical study comparing AI grading performance against human grading using a dataset of physics problems and a marking scheme.
Outcomes: AI grading showed substantial improvement when provided with a marking scheme, but was less effective than human grading without it. Results indicated that AI grading could reduce workload and provide quicker feedback.
Challenges: AI grading is prone to errors and hallucinations, leading to leniency and inconsistency compared to human grading.
Implementation Barriers
Technical
Mathematical errors and hallucinations in AI responses reduce grading quality and consistency.
Proposed Solutions: Implementation of a structured mark scheme to guide AI grading and improve accuracy.
Operational
The process of preparing AI for grading (e.g., digitizing handwritten solutions) can be cumbersome, and utilizing APIs for more efficient interaction with AI models could streamline the grading process.
Proposed Solutions: Utilizing APIs for more efficient interaction with AI models could streamline the grading process.
Project Team
Ryan Mok
Researcher
Faraaz Akhtar
Researcher
Louis Clare
Researcher
Christine Li
Researcher
Jun Ida
Researcher
Lewis Ross
Researcher
Mario Campanelli
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, Mario Campanelli
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai