Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment
Project Overview
The document explores the application of generative AI, particularly large language models (LLMs) like GPT-4o, in education, with a focus on enhancing educational assessments through innovative data integration methods. It highlights the use of synthetic data generated by LLMs to augment human-coded datasets, thereby improving text classification tasks in open-response assessments. The authors propose a hybrid approach that combines these two data sources, leading to increased accuracy and reliability of assessment classifiers. Experimental findings indicate that specific ratios of synthetic to human-coded data and temperature settings significantly influence performance stability. Overall, the integration of generative AI in educational assessment processes offers promising enhancements in the evaluation of student responses, demonstrating a potential shift towards more effective and efficient assessment strategies.
Key Applications
Augmentation of human-coded datasets with synthetic LLM-generated samples for text classification
Context: Open-response assessments in educational settings, targeting tutors and students
Implementation: Human-coded data was combined with LLM-generated responses to fine-tune a BERT classifier, assessing various ratios of synthetic to human data and temperature settings for data generation.
Outcomes: Improved classifier performance in predicting appropriate tutor responses based on a coding rubric, with optimal results at an 80% synthetic to 20% human-coded data ratio.
Challenges: Issues with the reliability of LLM outputs, potential for overfitting, and the need for effective regularization to manage the variability in generated data.
Implementation Barriers
Technical
Concerns about the validity and reliability of LLM outputs, including issues of hallucination and irrelevant information.
Proposed Solutions: Implementing rigorous evaluation measures and filtering inconsistent synthetic samples to improve model performance.
Human Resource
The requirement for significant amounts of human-coded data, which is labor-intensive to procure.
Proposed Solutions: Combining human-coded data with generative AI outputs to reduce the reliance on purely human-annotated datasets.
Methodological
Challenges in prompt engineering and ensuring diverse training data for effective LLM performance.
Proposed Solutions: Exploring advanced prompt techniques and optimizing generation protocols to balance variety and relevance in synthetic data.
Project Team
Conrad Borchers
Researcher
Danielle R. Thomas
Researcher
Jionghao Lin
Researcher
Ralph Abboud
Researcher
Kenneth R. Koedinger
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai