Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters
Project Overview
The document explores the integration of Generative AI, particularly Large Language Models (LLMs) such as GPT-3.5 and GPT-4, in the realm of education, focusing on their role in improving the identification of student errors in scientific experimentation protocols. It addresses the limitations of conventional assessment methods and advocates for the use of AI to deliver personalized feedback and enhance formative assessments in science education. Through a comparative analysis of AI-generated assessments versus those conducted by human raters, the study reveals both the advantages and drawbacks of employing AI in educational settings. The findings suggest that while AI can significantly aid in identifying specific student misconceptions and providing tailored guidance, there are challenges that need to be addressed, such as ensuring the accuracy and reliability of AI evaluations. Overall, the document underscores the transformative potential of generative AI in education, highlighting its capacity to facilitate more effective learning experiences and improve student outcomes.
Key Applications
AI system based on GPT-3.5 and GPT-4 for identifying student errors in experimentation protocols
Context: Secondary education (6th to 8th grade) in Germany, focusing on science subjects like biology and chemistry
Implementation: The AI system was trained on a dataset of student protocols to identify common errors in scientific inquiry. It was compared against human raters for accuracy.
Outcomes: The AI system demonstrated high accuracy in identifying fundamental errors, such as hypotheses not focusing on independent variables, with some errors being detected with accuracies up to 90%. It aims to provide personalized feedback to students.
Challenges: Challenges include accurately identifying complex errors, the need for substantial training data, and the potential for biases in AI assessments.
Implementation Barriers
Technical
The AI system struggles with complex, incomplete, or contradictory student data, which affects its ability to accurately identify errors. Limited testing data for certain errors leads to low prevalence for some assessments, reducing the effectiveness of the AI system.
Proposed Solutions: Improving the algorithms and training datasets to enhance the AI's understanding and analysis capabilities, and collecting a larger and more diverse dataset to improve the AI's performance and reliability in error detection.
Pedagogical
Skepticism among educators regarding the reliability of AI assessments and concerns about reducing the role of teachers.
Proposed Solutions: Emphasizing the role of AI as a supportive tool for teachers rather than a replacement, promoting a hybrid intelligence approach.
Project Team
Arne Bewersdorff
Researcher
Kathrin Seßler
Researcher
Armin Baur
Researcher
Enkelejda Kasneci
Researcher
Claudia Nerdel
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci, Claudia Nerdel
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai