Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
Project Overview
This document investigates the application of Large Language Models (LLMs) in education, particularly in addressing culturally adapted mathematical problems. It reveals that while LLMs have made significant strides in mathematical reasoning capabilities, they often falter when confronted with problems that incorporate culturally specific contexts. By creating culturally modified datasets derived from the GSM8K benchmark, the study assesses LLM performance across various cultural references. The findings suggest that LLMs exhibit superior performance with original datasets compared to their culturally adapted counterparts, a discrepancy attributed to biases present in the training data and challenges related to tokenization. This underscores the critical need for diverse and inclusive training data to enhance the robustness and applicability of LLMs in real-world educational settings. Overall, the document highlights the potential of generative AI in education while also pointing out the limitations that must be addressed to ensure equitable learning experiences across different cultural backgrounds.
Key Applications
LLMs for mathematical reasoning
Context: Evaluating mathematical reasoning abilities of LLMs in culturally adapted contexts for educational purposes.
Implementation: Created culturally adapted datasets based on the GSM8K benchmark, modifying cultural entities while preserving mathematical logic.
Outcomes: Identified performance drops in LLMs when faced with culturally adapted math problems, highlighting the need for diverse training data.
Challenges: LLMs struggle with cultural references leading to inaccurate reasoning, particularly in unfamiliar contexts.
Implementation Barriers
Technical Barrier
LLMs exhibit biases due to training on limited cultural contexts, leading to significant accuracy drops when tested on culturally adapted datasets compared to original datasets. Differences in tokenization for culturally specific terms can also impact model understanding and reasoning.
Proposed Solutions: Incorporating more diverse and representative training datasets to improve model performance across varied cultural contexts, and improving tokenizer efficiency to handle different languages and cultural vocabularies more effectively. Additionally, adopting models trained on diverse datasets and refining evaluation approaches to account for cultural variations.
Project Team
Aabid Karim
Researcher
Abdul Karim
Researcher
Bhoomika Lohana
Researcher
Matt Keon
Researcher
Jaswinder Singh
Researcher
Abdul Sattar
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18