Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis
Project Overview
The document explores the transformative role of generative AI, particularly large language models (LLMs), in enhancing formative assessment within STEM education. It emphasizes the shortcomings of conventional supervised classifiers in evaluating student responses when compared to the capabilities of LLMs. The study reveals that employing concept-based rubrics can substantially elevate LLM performance, allowing them to produce high-quality synthetic data that can subsequently train more effective supervised models. Experiments conducted on diverse STEM datasets illustrate the potential of LLMs to automate feedback and assessment processes, thereby improving educational outcomes. Furthermore, the research acknowledges the ongoing challenges associated with prompt engineering and the need for data diversity to optimize the effectiveness of these AI tools in educational settings. Overall, the findings indicate that generative AI holds significant promise for revolutionizing assessment methodologies in STEM education by providing timely and accurate feedback to students.
Key Applications
LLM-assisted assessment and data synthesis for training models
Context: Implementation in large introductory STEM classes targeting undergraduate students, focusing on formative assessment and the generation of synthetic training data for various educational purposes.
Implementation: LLMs were utilized to assess student responses using concept-based rubrics and to generate synthetic labeled student responses. This combined approach aimed at both formative assessment and creating training data for lightweight classifiers.
Outcomes: ['Improved accuracy of LLM assessments compared to traditional supervised models.', 'Narrowed performance gap with supervised models.', 'Enhanced quality of synthetic data for training, allowing models to perform comparably to those trained on large human-labeled datasets.']
Challenges: ['Careful prompt engineering is required for achieving high performance with LLMs.', 'Performance can vary significantly based on the dataset and the quality of rubrics.', 'Ensuring the quality of generated data and its alignment with intended labels, as well as the need to increase response diversity.']
Implementation Barriers
Technical Barrier
The performance of LLMs depends significantly on trial-and-error prompt engineering, which can be complex and inconsistent.
Proposed Solutions: Developing better prompting strategies and using concept-based rubrics to guide LLM responses and improve accuracy.
Data Quality Barrier
The quality of synthetic data generated by LLMs can vary, affecting the training of supervised models.
Proposed Solutions: Implementing methods to enhance the diversity and quality of generated responses, such as re-annotating synthetic samples.
Project Team
Yuchen Wei
Researcher
Dennis Pearl
Researcher
Matthew Beckman
Researcher
Rebecca J. Passonneau
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Yuchen Wei, Dennis Pearl, Matthew Beckman, Rebecca J. Passonneau
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai