Skip to main content Skip to navigation

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Project Overview

The document explores the application of large language models (LLMs) in automating the grading of undergraduate physics assessments within the educational landscape. It emphasizes the promise of LLMs to streamline the grading process, offering quicker feedback and minimizing human bias. Nevertheless, the study indicates that AI-generated grades are presently less dependable than those given by human evaluators, primarily due to challenges such as mathematical inaccuracies and the phenomenon of hallucinations in AI responses. Notably, the implementation of a structured marking scheme significantly boosts the accuracy of AI grading. The findings suggest a positive correlation between the problem-solving capabilities of LLMs and their effectiveness in grading, implying that advancements in AI technology could enhance educational outcomes in the future. Overall, while generative AI holds substantial potential for transforming educational assessment, its current limitations highlight the necessity for continued development and refinement.

Key Applications

AI grading of undergraduate physics problems using LLMs (e.g., GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro)

Context: Higher education, targeting undergraduate physics students at University College London

Implementation: Empirical study comparing AI grading performance against human grading using a dataset of physics problems and a marking scheme.

Outcomes: AI grading showed substantial improvement when provided with a marking scheme, but was less effective than human grading without it. Results indicated that AI grading could reduce workload and provide quicker feedback.

Challenges: AI grading is prone to errors and hallucinations, leading to leniency and inconsistency compared to human grading.

Implementation Barriers

Technical

Mathematical errors and hallucinations in AI responses reduce grading quality and consistency.

Proposed Solutions: Implementation of a structured mark scheme to guide AI grading and improve accuracy.

Operational

The process of preparing AI for grading (e.g., digitizing handwritten solutions) can be cumbersome, and utilizing APIs for more efficient interaction with AI models could streamline the grading process.

Proposed Solutions: Utilizing APIs for more efficient interaction with AI models could streamline the grading process.

Project Team

Ryan Mok

Researcher

Faraaz Akhtar

Researcher

Louis Clare

Researcher

Christine Li

Researcher

Jun Ida

Researcher

Lewis Ross

Researcher

Mario Campanelli

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, Mario Campanelli

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies