Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
Project Overview
The document explores the application of generative AI in education through the SPHERE framework, designed to enhance mathematical reasoning in small language models (SLMs). It outlines a three-stage process comprising Self-Generation, Self-Correction, and Diversity Generation, which collectively bolster the models' ability to tackle complex multi-step reasoning tasks. By utilizing a pruned Monte Carlo Tree Search (MCTS) for efficient reasoning path exploration, SPHERE enables models to learn from both correct and incorrect reasoning trajectories, thereby fostering self-evolution. Evaluations reveal that this approach leads to substantial performance improvements compared to baseline models, underscoring the potential of generative AI in refining multi-step reasoning capabilities in educational contexts. Overall, the findings highlight SPHERE's effectiveness and its implications for advancing AI technologies in educational settings, particularly in enhancing students' mathematical reasoning skills.
Key Applications
SPHERE - Self-Evolved Preference Optimization
Context: Educational context: Enhancing mathematical reasoning in small language models (SLMs); Target audience: Students and educators using AI tools for math problem-solving.
Implementation: Implemented a three-stage process (Self-Generation, Self-Correction, Diversity Generation) using pruned MCTS to generate high-quality preference data without human supervision.
Outcomes: Models trained with SPHERE showed significant performance improvements on benchmarks like MATH 500, GSM8K, and others, surpassing baseline models including GPT-4o.
Challenges: Challenges include computational intensity of MCTS rollouts and ensuring comprehensive coverage of failure cases.
Implementation Barriers
Technical Barrier
MCTS rollouts are computationally intensive, making large-scale training challenging.
Proposed Solutions: Future work could explore more efficient search strategies or adaptive pruning techniques to reduce overhead.
Coverage Barrier
Self-correction does not guarantee exhaustive coverage of all failure cases.
Proposed Solutions: Incorporating external verifiers or broader failure taxonomies could enhance robustness.
Domain Limitation
Focus primarily on mathematical reasoning may limit generalization to other structured domains.
Proposed Solutions: Exploration of the framework's application to other domains like program synthesis or theorem proving.
Project Team
Joykirat Singh
Researcher
Tanmoy Chakraborty
Researcher
Akshay Nambi
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Joykirat Singh, Tanmoy Chakraborty, Akshay Nambi
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai