Skip to main content Skip to navigation

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

Project Overview

The document explores the use of generative AI, particularly Large Language Models (LLMs) and Generative Adversarial Networks (GANs), in the context of education, focusing on the generation of synthetic student data for learning analytics. It addresses the challenges posed by privacy concerns in collecting genuine student data, proposing synthetic data generation as a practical solution. The study assesses how effectively synthetic data can replicate real student data and its applicability in educational data science. Findings indicate that synthetic data closely resembles real data and performs similarly in predictive modeling tasks, thereby mitigating issues related to data scarcity and quality in learning analytics. Overall, the document underscores the promise of generative AI in enhancing educational research and analytics by providing a reliable alternative to traditional data collection methods while ensuring student privacy.

Key Applications

Synthetic Data Generation using GANs and LLMs

Context: Educational data science, focusing on creating synthetic student data for learning analytics

Implementation: Utilized CTGAN and LLMs (GPT2, DistilGPT2, DialoGPT) to generate synthetic tabular data that mimics real student data.

Outcomes: Generated synthetic datasets that closely resemble real data, improving predictive modeling and addressing data scarcity.

Challenges: Data quality and ethical concerns related to using real student data; challenges in generating high-quality synthetic data.

Implementation Barriers

Ethical

Concerns over privacy and data protection regulations limit the availability of real student data for learning analytics.

Proposed Solutions: Utilizing synthetic data generation as an alternative to real data to mitigate privacy concerns.

Data Quality and Scarcity

Issues such as incomplete datasets, inaccuracies, biases in data collection, and insufficient training data for machine learning models can undermine the validity of learning analytics and lead to flawed predictions and underprivileged decision-making.

Proposed Solutions: Investing in data quality measures and data sharing mechanisms to enhance the reliability of learning analytics, along with utilizing synthetic data generation to provide additional data for training models effectively.

Project Team

Mohammad Khalil

Researcher

Farhad Vadiee

Researcher

Ronas Shakya

Researcher

Qinyi Liu

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Mohammad Khalil, Farhad Vadiee, Ronas Shakya, Qinyi Liu

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies