Skip to main content Skip to navigation

Using Large Language Models for Automated Grading of Student Writing about Science

Project Overview

The document explores the application of generative AI, particularly GPT-4, in the automated grading of student writing within Massive Open Online Courses (MOOCs) centered on science subjects. It highlights the capacity of Large Language Models (LLMs) to achieve grading reliability comparable to traditional methods, including peer evaluations and instructor assessments. The findings indicate that when LLMs are given suitable prompts, such as model answers and grading rubrics, they can generate grades that align closely with those of human graders. Additionally, the document examines challenges associated with grading open-ended responses and the dependability of peer assessments, suggesting that while LLMs demonstrate significant potential in educational settings, careful consideration is needed to address these issues effectively. Overall, the research underscores the transformative role of generative AI in enhancing grading efficiency and consistency in education.

Key Applications

Automated grading of writing assignments using GPT-4

Context: Massive Open Online Courses (MOOCs) for adult learners, with no prior science background, in topics like astronomy, astrobiology, and the history and philosophy of astronomy.

Implementation: An experiment was conducted where GPT-4 was provided with instructor model answers and rubrics to evaluate student writing assignments.

Outcomes: LLMs showed more reliability than peer grading and matched instructor grading for all three courses. They were effective in automating grading for large classes.

Challenges: LLMs struggled with grading more subjective assignments, especially in the history and philosophy course.

Implementation Barriers

Technical Limitations

LLMs face challenges in evaluating creative assignments and those requiring higher-order thinking. Future work aims to develop writing assignments and grading systems that play to the strengths of LLMs.

Sampling Bias

The study used a purposeful sampling method, which could introduce bias and may not represent the broader student population. Random sampling methods should be considered in future studies to improve representativeness.

Variability in Grading Interpretations

Differences in interpretations of grading rubrics between LLMs, instructors, and peer graders may affect grading consistency. Utilizing more detailed and standardized rubrics could help align grading interpretations.

Project Team

Chris Impey

Researcher

Matthew Wenger

Researcher

Nikhil Garuda

Researcher

Shahriar Golchin

Researcher

Sarah Stamer

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Chris Impey, Matthew Wenger, Nikhil Garuda, Shahriar Golchin, Sarah Stamer

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies