Skip to main content Skip to navigation

Exploring Durham University Physics exams with Large Language Models

Project Overview

The document examines the application of advanced natural language processing models, specifically GPT-4 and GPT-3.5, in the context of university-level physics exams at Durham University. Analyzing exam results from 2018 to 2022, the study reveals that while these AI models can achieve satisfactory scores, their inconsistency ensures they do not significantly challenge the integrity of physics assessments. The findings indicate that generative AI can serve as a valuable educational tool, capable of providing hints and support to students, although it does not surpass the performance of average students. Furthermore, the study emphasizes the necessity for universities to revise their assessment strategies to keep pace with the advancements in AI technology, ensuring that educational practices remain relevant and effective in light of these developments. Overall, the research underscores the potential of generative AI in education while cautioning against over-reliance on its current capabilities.

Key Applications

Evaluation of GPT-4 and GPT-3.5 on university-level physics exams

Context: University-level assessments at Durham University, targeting undergraduate and postgraduate physics students

Implementation: Automated extraction and evaluation of exam questions using OpenAI's API, followed by assessment by original exam markers

Outcomes: GPT-4 scored an average of 49.4% and GPT-3.5 scored 38.6%, indicating potential for AI to assist weaker students but not to replace traditional assessments.

Challenges: AI struggled with complex questions requiring deeper physics understanding, and the reliance on prompt quality affected performance.

Implementation Barriers

Technical Limitations

AI models struggled with university-level physics questions that require a deeper understanding of concepts rather than just factual recall. This includes challenges with multipart problems and graphical interpretation.

Proposed Solutions: Adapting assessment types to include multipart problems and graphical interpretation, which AI currently finds challenging.

Assessment Integrity

Concerns about the fidelity of assessments as AI models improve and could potentially achieve higher scores in the future. This necessitates regular monitoring of AI capabilities.

Proposed Solutions: Regularly monitoring AI capabilities and revising assessment methods to ensure academic integrity.

Project Team

Will Yeadon

Researcher

Douglas P. Halliday

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Will Yeadon, Douglas P. Halliday

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies