Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Project Overview
The document explores the application of Large Language Models (LLMs) in Automated Educational Question Generation (AEQG), focusing on their ability to create diverse and high-quality educational questions across various cognitive levels as outlined by Bloom's taxonomy. It assesses the performance of five advanced LLMs, specifically analyzing how factors such as prompt complexity and model size influence the quality of the questions generated. Findings indicate that LLMs can effectively produce relevant questions, though their performance varies significantly depending on the model selected and the prompting techniques employed, with larger models like GPT-4 demonstrating superior results. Additionally, the document highlights the challenges associated with assessing the quality of machine-generated content in comparison to human-generated questions, emphasizing the need for careful evaluation methods. Overall, the study underscores the potential of generative AI to enhance educational practices while also acknowledging the complexities involved in its implementation and assessment.
Key Applications
Automated Educational Question Generation (AEQG) using Large Language Models
Context: Graduate-level data science course
Implementation: Utilized five different LLMs (Mistral, Llama2, Palm 2, GPT-3.5, GPT-4) with various prompt strategies to generate questions based on Bloom's taxonomy. Experts evaluated the questions for quality.
Outcomes: 78% of generated questions were rated as high quality, showcasing LLMs' ability to produce diverse educational questions. The best results came from prompts that included Chain-of-Thought (CoT) instructions.
Challenges: Variability in performance across different LLMs and the tendency of models to generate lower-order questions. Automated evaluations were not as effective as human evaluations.
Implementation Barriers
Implementation Challenge
The performance of LLMs varies significantly based on their size and the complexity of the prompts used, leading to inconsistent quality in generated questions.
Proposed Solutions: Use optimal prompting strategies that balance complexity with clarity (e.g., including CoT instructions and examples).
Evaluation Challenge
Automated evaluation methods are not on par with human evaluations, leading to discrepancies in quality assessments.
Proposed Solutions: Future work should focus on training LLMs on evaluation datasets to improve robustness and alignment with expert evaluations.
Project Team
Nicy Scaria
Researcher
Suma Dharani Chenna
Researcher
Deepak Subramani
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Nicy Scaria, Suma Dharani Chenna, Deepak Subramani
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai