Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

Project Overview

The document explores the transformative role of generative AI, particularly large language models (LLMs), in enhancing formative assessment within STEM education. It emphasizes the shortcomings of conventional supervised classifiers in evaluating student responses when compared to the capabilities of LLMs. The study reveals that employing concept-based rubrics can substantially elevate LLM performance, allowing them to produce high-quality synthetic data that can subsequently train more effective supervised models. Experiments conducted on diverse STEM datasets illustrate the potential of LLMs to automate feedback and assessment processes, thereby improving educational outcomes. Furthermore, the research acknowledges the ongoing challenges associated with prompt engineering and the need for data diversity to optimize the effectiveness of these AI tools in educational settings. Overall, the findings indicate that generative AI holds significant promise for revolutionizing assessment methodologies in STEM education by providing timely and accurate feedback to students.

Key Applications

LLM-assisted assessment and data synthesis for training models

Context: Implementation in large introductory STEM classes targeting undergraduate students, focusing on formative assessment and the generation of synthetic training data for various educational purposes.

Implementation: LLMs were utilized to assess student responses using concept-based rubrics and to generate synthetic labeled student responses. This combined approach aimed at both formative assessment and creating training data for lightweight classifiers.

Outcomes: ['Improved accuracy of LLM assessments compared to traditional supervised models.', 'Narrowed performance gap with supervised models.', 'Enhanced quality of synthetic data for training, allowing models to perform comparably to those trained on large human-labeled datasets.']

Challenges: ['Careful prompt engineering is required for achieving high performance with LLMs.', 'Performance can vary significantly based on the dataset and the quality of rubrics.', 'Ensuring the quality of generated data and its alignment with intended labels, as well as the need to increase response diversity.']

Implementation Barriers

Technical Barrier

The performance of LLMs depends significantly on trial-and-error prompt engineering, which can be complex and inconsistent.

Proposed Solutions: Developing better prompting strategies and using concept-based rubrics to guide LLM responses and improve accuracy.

Data Quality Barrier

The quality of synthetic data generated by LLMs can vary, affecting the training of supervised models.

Proposed Solutions: Implementing methods to enhance the diversity and quality of generated responses, such as re-annotating synthetic samples.

Project Team

Yuchen Wei

Researcher

Dennis Pearl

Researcher

Matthew Beckman

Researcher

Rebecca J. Passonneau

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Yuchen Wei, Dennis Pearl, Matthew Beckman, Rebecca J. Passonneau

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects