Skip to main content Skip to navigation

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

Project Overview

The document explores the integration of Generative AI, particularly Large Language Models (LLMs) such as GPT-3.5 and GPT-4, in the realm of education, focusing on their role in improving the identification of student errors in scientific experimentation protocols. It addresses the limitations of conventional assessment methods and advocates for the use of AI to deliver personalized feedback and enhance formative assessments in science education. Through a comparative analysis of AI-generated assessments versus those conducted by human raters, the study reveals both the advantages and drawbacks of employing AI in educational settings. The findings suggest that while AI can significantly aid in identifying specific student misconceptions and providing tailored guidance, there are challenges that need to be addressed, such as ensuring the accuracy and reliability of AI evaluations. Overall, the document underscores the transformative potential of generative AI in education, highlighting its capacity to facilitate more effective learning experiences and improve student outcomes.

Key Applications

AI system based on GPT-3.5 and GPT-4 for identifying student errors in experimentation protocols

Context: Secondary education (6th to 8th grade) in Germany, focusing on science subjects like biology and chemistry

Implementation: The AI system was trained on a dataset of student protocols to identify common errors in scientific inquiry. It was compared against human raters for accuracy.

Outcomes: The AI system demonstrated high accuracy in identifying fundamental errors, such as hypotheses not focusing on independent variables, with some errors being detected with accuracies up to 90%. It aims to provide personalized feedback to students.

Challenges: Challenges include accurately identifying complex errors, the need for substantial training data, and the potential for biases in AI assessments.

Implementation Barriers

Technical

The AI system struggles with complex, incomplete, or contradictory student data, which affects its ability to accurately identify errors. Limited testing data for certain errors leads to low prevalence for some assessments, reducing the effectiveness of the AI system.

Proposed Solutions: Improving the algorithms and training datasets to enhance the AI's understanding and analysis capabilities, and collecting a larger and more diverse dataset to improve the AI's performance and reliability in error detection.

Pedagogical

Skepticism among educators regarding the reliability of AI assessments and concerns about reducing the role of teachers.

Proposed Solutions: Emphasizing the role of AI as a supportive tool for teachers rather than a replacement, promoting a hybrid intelligence approach.

Project Team

Arne Bewersdorff

Researcher

Kathrin Seßler

Researcher

Armin Baur

Researcher

Enkelejda Kasneci

Researcher

Claudia Nerdel

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci, Claudia Nerdel

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies