Skip to main content Skip to navigation

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Project Overview

The document explores the incorporation of Generative AI, particularly Multimodal Large Language Models (MLLMs), into STEM education through Project-Based Learning (PBL). It introduces PBLBench, a benchmarking tool aimed at evaluating the performance of MLLMs in assessing PBL outcomes that encompass various modalities, including text, images, code, and video. To facilitate this evaluation, the PBL-STEM dataset was developed. While the initial findings indicate the promise of MLLMs in enhancing educational assessments, significant challenges remain, such as low ranking accuracy and considerable instability in the evaluation processes. These issues highlight the need for further advancements to ensure that MLLMs can be effectively and reliably integrated into educational frameworks, ultimately improving the learning experience and outcomes in STEM disciplines.

Key Applications

PBLBench and PBL-STEM dataset

Context: Educational settings focused on STEM disciplines, targeting students engaged in Project-Based Learning activities.

Implementation: The PBL-STEM dataset was created to include diverse project outcomes with various modalities. PBLBench was developed to evaluate MLLMs on these outcomes using structured evaluation criteria.

Outcomes: The benchmark aims to provide a reliable assessment framework, enhance teacher workload management, and improve feedback to students.

Challenges: Current models exhibit low ranking accuracy and hallucinations, making them unreliable for comprehensive PBL evaluations.

Implementation Barriers

Technical Limitations

Models often produce hallucinations and unstable outputs, leading to unreliable assessments.

Proposed Solutions: Developing self-verification mechanisms for models to enhance scoring stability.

Evaluation Challenges

Existing benchmarks do not provide a free-form output structure or rigorous validation processes.

Proposed Solutions: Implementing expert-driven evaluation methods like the Analytic Hierarchy Process (AHP) to derive structured criteria.

Project Team

Yanhao Jia

Researcher

Xinyi Wu

Researcher

Qinglin Zhang

Researcher

Yiran Qin

Researcher

Luwei Xiao

Researcher

Shuai Zhao

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Yanhao Jia, Xinyi Wu, Qinglin Zhang, Yiran Qin, Luwei Xiao, Shuai Zhao

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies