Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments
Project Overview
The document explores the use of generative AI, particularly chatbots like ChatGPT, Claude, and Gemini, in the context of education, focusing on their ability to answer multiple-choice questions (MCQs) and the implications for academic integrity. It highlights the application of Item Response Theory (IRT) to identify potential cheating by distinguishing between human and AI-generated responses. The study introduces a method utilizing Person-Fit Statistics (PFS) to detect deviations in response patterns that suggest AI involvement, revealing significant differences between the two types of responses. However, it notes a concerning trend: as the prevalence of AI in educational contexts rises, the effectiveness of detection methods declines. The document emphasizes the critical role of educational measurement theory in the development of robust assessment tools to maintain academic standards and integrity amid the growing influence of generative AI in education.
Key Applications
Using Item Response Theory and Person-Fit Statistics to detect AI cheating in MCQs.
Context: High-school and standardized assessments with multiple-choice questions.
Implementation: Analyzed response patterns of human learners and various AI chatbots using statistical methods to identify deviations.
Outcomes: Demonstrated significant differences between human and AI responses, providing a method for detecting AI cheating.
Challenges: Detection effectiveness decreases as the prevalence of AI responses increases.
Implementation Barriers
Technical Barrier
Sensitivity of detection methods to the level of AI response prevalence (pollution).
Proposed Solutions: Develop more robust statistical methods that account for varying levels of AI response integration in assessments.
Implementation Barrier
Methods are retrospective and less effective for real-time detection of AI cheating.
Proposed Solutions: Explore real-time monitoring systems and adaptive assessment designs.
Validity Barrier
Misfit patterns that lead to high PFS scores may arise from legitimate conditions such as learning disabilities. This includes the challenge of differentiating between legitimate variations in human responses and AI-generated patterns.
Proposed Solutions: Refine detection criteria to effectively distinguish between legitimate human responses and those generated by AI.
Project Team
Alona Strugatski
Researcher
Giora Alexandron
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Alona Strugatski, Giora Alexandron
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai