Skip to main content Skip to navigation

Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

Project Overview

The document explores the application of generative AI, particularly large language models (LLMs) like GPT-4o, in education, with a focus on enhancing educational assessments through innovative data integration methods. It highlights the use of synthetic data generated by LLMs to augment human-coded datasets, thereby improving text classification tasks in open-response assessments. The authors propose a hybrid approach that combines these two data sources, leading to increased accuracy and reliability of assessment classifiers. Experimental findings indicate that specific ratios of synthetic to human-coded data and temperature settings significantly influence performance stability. Overall, the integration of generative AI in educational assessment processes offers promising enhancements in the evaluation of student responses, demonstrating a potential shift towards more effective and efficient assessment strategies.

Key Applications

Augmentation of human-coded datasets with synthetic LLM-generated samples for text classification

Context: Open-response assessments in educational settings, targeting tutors and students

Implementation: Human-coded data was combined with LLM-generated responses to fine-tune a BERT classifier, assessing various ratios of synthetic to human data and temperature settings for data generation.

Outcomes: Improved classifier performance in predicting appropriate tutor responses based on a coding rubric, with optimal results at an 80% synthetic to 20% human-coded data ratio.

Challenges: Issues with the reliability of LLM outputs, potential for overfitting, and the need for effective regularization to manage the variability in generated data.

Implementation Barriers

Technical

Concerns about the validity and reliability of LLM outputs, including issues of hallucination and irrelevant information.

Proposed Solutions: Implementing rigorous evaluation measures and filtering inconsistent synthetic samples to improve model performance.

Human Resource

The requirement for significant amounts of human-coded data, which is labor-intensive to procure.

Proposed Solutions: Combining human-coded data with generative AI outputs to reduce the reliance on purely human-annotated datasets.

Methodological

Challenges in prompt engineering and ensuring diverse training data for effective LLM performance.

Proposed Solutions: Exploring advanced prompt techniques and optimizing generation protocols to balance variety and relevance in synthetic data.

Project Team

Conrad Borchers

Researcher

Danielle R. Thomas

Researcher

Jionghao Lin

Researcher

Ralph Abboud

Researcher

Kenneth R. Koedinger

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies