Evaluating Multimodal Generative AI with Korean Educational Standards
Project Overview
This document presents an evaluation of Multimodal Generative AI (MLLMs) in an educational context, specifically using the KoNET benchmark dataset based on Korean national educational tests. The study assesses the performance of various AI models, including open-source and closed-API systems, across different educational levels, analyzing their accuracy and error patterns in comparison to human performance. Key applications of these models within education are for tasks assessed in national tests. Findings reveal a performance disparity between open and closed-source models, highlighting the influence of linguistic and cultural specificity on AI capabilities. The research aims to foster advancements in multilingual AI, promote inclusivity, and encourage diverse applications of AI in education.
Key Applications
AI-Powered Assessment & Tutoring (KoNET Benchmark)
Context: Evaluating Multilingual Large Language Models (MLLMs) on their ability to solve problems from Korean educational tests (KoEGED, KoMGED, KoHGED, KoCSAT) across different educational levels, including multimodal question answering (VQA) with images. This benchmark allows for evaluating AI-driven educational technologies like AI tutoring.
Implementation: The KoNET benchmark utilizes publicly available PDFs from the Korea Institute of Curriculum and Evaluation. It converts questions from Korean national educational tests into a multimodal VQA format, where questions and answer choices are embedded within images. The implementation uses Chain-of-Thought (CoT) prompting, an off-the-shelf OCR API to translate image content, and an LLM-as-a-Judge approach to evaluate responses. The benchmark results are used to identify core competencies crucial for AI-driven educational technologies.
Outcomes: Provides a comprehensive evaluation of MLLMs in the Korean language. Enables a more realistic assessment of multimodal comprehension and reasoning abilities. Allows for direct comparisons to human performance. Offers insights into model performance in non-English contexts. Reveals the impact of linguistic and cultural specificity on AI performance. Offers potential real-world applicability in the AI tutoring market. Provides insights into AI and human performance comparisons.
Challenges: Models experience increased difficulty with advancing levels in the Korean curriculum. Open-source models may lack tuning for Korean domains. MLLMs sometimes lag behind LLMs. Open-source MLLMs may struggle with text recognition in the Korean context. The primary format is multiple-choice QA, which may not fully capture a model's capacity to articulate problem-solving processes.
Implementation Barriers
Data scarcity/Bias
The primary format is multiple-choice QA, which may not fully capture a model’s capacity to articulate problem-solving processes. Periodic updates to the test set are necessary to mitigate potential biases and data contamination upon public release.
Proposed Solutions: Future work could focus on evaluating models’ reasoning abilities by incorporating rationales behind their answers. This advancement necessitates the development of comprehensive reference answers. The dataset construction methodology, along with the open-source dataset builder, will empower the research community to continuously update KoNET, ensuring its ongoing relevance and utility.
Model Performance
Open-source models may lack tuning for Korean domains
Proposed Solutions: Greater focus on non-English languages in future research and open-source AI development.
Project Team
Sanghee Park
Researcher
Geewook Kim
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Sanghee Park, Geewook Kim
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gemini-2.0-flash-lite