Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

Project Overview

The document examines the role of generative AI in education, specifically focusing on the evaluation of mathematical reasoning capabilities in various Large Language Models (LLMs). It highlights the performance of leading models, including OpenAI's GPT-4o and DeepSeek-R1, across multiple benchmarks, revealing their strengths in processing complex mathematical problems while also addressing their weaknesses. Key applications of these LLMs in educational settings include personalized learning, automated tutoring, and enhancing student engagement through interactive problem-solving experiences. The findings underscore the need for continuous improvement in LLM capabilities to better support educational outcomes, particularly in refining their mathematical reasoning skills. Additionally, the document emphasizes the importance of tackling computational demands and generalization challenges that arise during the application of these models in educational contexts. Overall, the use of generative AI in education presents significant potential for transforming learning experiences, but it also necessitates ongoing research and development to maximize effectiveness and accessibility.

Key Applications

Evaluation of mathematical reasoning in LLMs

Context: Educational context focusing on advanced mathematical problem-solving tasks, targeting researchers and developers in AI and education.

Implementation: Systematic evaluation using benchmark datasets (MATH, GSM8K, MMLU) and a zero-shot prompting strategy to assess LLMs' reasoning capabilities.

Outcomes: Comparative analysis revealing LLMs' strengths and weaknesses in mathematical reasoning, with implications for future AI model development.

Challenges: High computational demands, performance inconsistencies in specialized tasks, and limitations in generalization across diverse mathematical domains.

Implementation Barriers

Technical Barrier

High computational demands for running advanced LLMs, making large-scale implementation costly.

Proposed Solutions: Exploration of more efficient model architectures, such as the Mixture-of-Experts (MoE) approach to reduce computational overhead.

Performance Barrier

LLMs struggle with specialized tasks, showing inconsistencies in performance and reasoning capabilities.

Proposed Solutions: Development of hybrid reasoning frameworks combining reinforcement learning and structured step-by-step inference methods.

Project Team

Afrar Jahin

Researcher

Arif Hassan Zidan

Researcher

Wei Zhang

Researcher

Yu Bao

Researcher

Tianming Liu

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Afrar Jahin, Arif Hassan Zidan, Wei Zhang, Yu Bao, Tianming Liu

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects