Skip to main content Skip to navigation

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

Project Overview

The document presents ARES, a hybrid algorithm that integrates Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to bolster multi-modal reasoning in AI applications within education. By leveraging feedback from advanced AI models such as GPT-4 and Claude 3 Opus, ARES aims to enhance the quality of rationale reasoning, specifically in educational datasets like ScienceQA and A-OKVQA. This innovative approach tackles the limitations of existing reinforcement learning methods, showcasing significant improvements in model performance, reasoning quality, and inference accuracy. Overall, the findings indicate that generative AI, through techniques like ARES, can effectively support educational tools by providing more accurate and contextually relevant responses, thereby enriching the learning experience and aiding educators in delivering personalized instruction.

Key Applications

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning

Context: Educational settings utilizing multi-modal datasets such as ScienceQA and A-OKVQA, targeting students from elementary to high school levels.

Implementation: The ARES algorithm alternates between RL using sentence-level feedback from an AI model and SFT for correcting errors, stabilizing the model's outputs.

Outcomes: Achieved around 70% win rate in rationale reasoning quality compared to baseline models and a 2.5% increase in inference answer accuracy.

Challenges: Challenges include hyperparameter tuning during RL leading to repetitive or truncated sentences and the need for correction feedback to stabilize the model.

Implementation Barriers

Technical barrier

Instability in Reinforcement Learning that requires extensive hyperparameter tuning, leading to issues like repetitive and truncated sentences.

Proposed Solutions: Using Supervised Fine-Tuning after RL to correct errors and stabilize the model.

Resource barrier

Cost and usage limits associated with API access to advanced AI models for feedback.

Proposed Solutions: Developing public or lower-cost alternatives for accessing necessary AI feedback.

Knowledge barrier

Difficulty in addressing complex tasks requiring external knowledge beyond the capabilities of the model.

Proposed Solutions: Future research to integrate external knowledge sources into the model.

Project Team

Ju-Seung Byun

Researcher

Jiyun Chun

Researcher

Jihyung Kil

Researcher

Andrew Perrault

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Ju-Seung Byun, Jiyun Chun, Jihyung Kil, Andrew Perrault

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies