TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Project Overview
The document explores the development of the TF1-EN-3M dataset, which comprises three million synthetic moral fables created using instruction-tuned language models. This initiative addresses the scarcity of moral storytelling resources by offering structured narratives that serve educational purposes, particularly aimed at young audiences to impart moral lessons. It details the methodologies employed in generating the dataset, including prompt design and model evaluation, emphasizing the advantages of utilizing smaller, more accessible models to produce high-quality educational content. The findings underscore the significant potential of this dataset in enhancing educational AI applications and highlight the critical role of moral reasoning in the generation of narratives, suggesting that such resources can effectively contribute to teaching ethical values in an engaging manner.
Key Applications
TF1-EN-3M Dataset
Context: Educational context targeting young readers (ages 4-7) for moral education through storytelling.
Implementation: Generated using a hybrid evaluation pipeline with instruction-tuned models, focusing on structured prompt design.
Outcomes: Produced three million diverse moral fables that are coherent and age-appropriate, facilitating moral reasoning in educational settings.
Challenges: Ensuring diversity in stories while maintaining coherence, and the potential for over-reliance on templates leading to repetitive narratives.
Implementation Barriers
Technical Barrier
The need for computational resources to run large models for generating quality narratives.
Proposed Solutions: Using smaller, instruction-tuned models to generate high-quality content on consumer-grade hardware.
Cultural Barrier
The dataset primarily reflects Western moral traditions, which may not generalize across different cultural contexts. Future expansions of the dataset could incorporate moral principles from diverse philosophical traditions.
Proposed Solutions: Incorporating moral principles from diverse philosophical traditions into the dataset.
Project Team
Mihai Nadas
Researcher
Laura Diosan
Researcher
Andrei Piscoran
Researcher
Andreea Tomescu
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Mihai Nadas, Laura Diosan, Andrei Piscoran, Andreea Tomescu
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai