Skip to main content Skip to navigation

Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

Project Overview

The document explores the use of generative AI, particularly chatbots like ChatGPT, Claude, and Gemini, in the context of education, focusing on their ability to answer multiple-choice questions (MCQs) and the implications for academic integrity. It highlights the application of Item Response Theory (IRT) to identify potential cheating by distinguishing between human and AI-generated responses. The study introduces a method utilizing Person-Fit Statistics (PFS) to detect deviations in response patterns that suggest AI involvement, revealing significant differences between the two types of responses. However, it notes a concerning trend: as the prevalence of AI in educational contexts rises, the effectiveness of detection methods declines. The document emphasizes the critical role of educational measurement theory in the development of robust assessment tools to maintain academic standards and integrity amid the growing influence of generative AI in education.

Key Applications

Using Item Response Theory and Person-Fit Statistics to detect AI cheating in MCQs.

Context: High-school and standardized assessments with multiple-choice questions.

Implementation: Analyzed response patterns of human learners and various AI chatbots using statistical methods to identify deviations.

Outcomes: Demonstrated significant differences between human and AI responses, providing a method for detecting AI cheating.

Challenges: Detection effectiveness decreases as the prevalence of AI responses increases.

Implementation Barriers

Technical Barrier

Sensitivity of detection methods to the level of AI response prevalence (pollution).

Proposed Solutions: Develop more robust statistical methods that account for varying levels of AI response integration in assessments.

Implementation Barrier

Methods are retrospective and less effective for real-time detection of AI cheating.

Proposed Solutions: Explore real-time monitoring systems and adaptive assessment designs.

Validity Barrier

Misfit patterns that lead to high PFS scores may arise from legitimate conditions such as learning disabilities. This includes the challenge of differentiating between legitimate variations in human responses and AI-generated patterns.

Proposed Solutions: Refine detection criteria to effectively distinguish between legitimate human responses and those generated by AI.

Project Team

Alona Strugatski

Researcher

Giora Alexandron

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Alona Strugatski, Giora Alexandron

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies