Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation
Project Overview
The document explores the use of generative AI, particularly Large Language Models (LLMs) and Generative Adversarial Networks (GANs), in the context of education, focusing on the generation of synthetic student data for learning analytics. It addresses the challenges posed by privacy concerns in collecting genuine student data, proposing synthetic data generation as a practical solution. The study assesses how effectively synthetic data can replicate real student data and its applicability in educational data science. Findings indicate that synthetic data closely resembles real data and performs similarly in predictive modeling tasks, thereby mitigating issues related to data scarcity and quality in learning analytics. Overall, the document underscores the promise of generative AI in enhancing educational research and analytics by providing a reliable alternative to traditional data collection methods while ensuring student privacy.
Key Applications
Synthetic Data Generation using GANs and LLMs
Context: Educational data science, focusing on creating synthetic student data for learning analytics
Implementation: Utilized CTGAN and LLMs (GPT2, DistilGPT2, DialoGPT) to generate synthetic tabular data that mimics real student data.
Outcomes: Generated synthetic datasets that closely resemble real data, improving predictive modeling and addressing data scarcity.
Challenges: Data quality and ethical concerns related to using real student data; challenges in generating high-quality synthetic data.
Implementation Barriers
Ethical
Concerns over privacy and data protection regulations limit the availability of real student data for learning analytics.
Proposed Solutions: Utilizing synthetic data generation as an alternative to real data to mitigate privacy concerns.
Data Quality and Scarcity
Issues such as incomplete datasets, inaccuracies, biases in data collection, and insufficient training data for machine learning models can undermine the validity of learning analytics and lead to flawed predictions and underprivileged decision-making.
Proposed Solutions: Investing in data quality measures and data sharing mechanisms to enhance the reliability of learning analytics, along with utilizing synthetic data generation to provide additional data for training models effectively.
Project Team
Mohammad Khalil
Researcher
Farhad Vadiee
Researcher
Ronas Shakya
Researcher
Qinyi Liu
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Mohammad Khalil, Farhad Vadiee, Ronas Shakya, Qinyi Liu
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai