Skip to main content Skip to navigation

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Project Overview

This document investigates the application of Large Language Models (LLMs) in education, particularly in addressing culturally adapted mathematical problems. It reveals that while LLMs have made significant strides in mathematical reasoning capabilities, they often falter when confronted with problems that incorporate culturally specific contexts. By creating culturally modified datasets derived from the GSM8K benchmark, the study assesses LLM performance across various cultural references. The findings suggest that LLMs exhibit superior performance with original datasets compared to their culturally adapted counterparts, a discrepancy attributed to biases present in the training data and challenges related to tokenization. This underscores the critical need for diverse and inclusive training data to enhance the robustness and applicability of LLMs in real-world educational settings. Overall, the document highlights the potential of generative AI in education while also pointing out the limitations that must be addressed to ensure equitable learning experiences across different cultural backgrounds.

Key Applications

LLMs for mathematical reasoning

Context: Evaluating mathematical reasoning abilities of LLMs in culturally adapted contexts for educational purposes.

Implementation: Created culturally adapted datasets based on the GSM8K benchmark, modifying cultural entities while preserving mathematical logic.

Outcomes: Identified performance drops in LLMs when faced with culturally adapted math problems, highlighting the need for diverse training data.

Challenges: LLMs struggle with cultural references leading to inaccurate reasoning, particularly in unfamiliar contexts.

Implementation Barriers

Technical Barrier

LLMs exhibit biases due to training on limited cultural contexts, leading to significant accuracy drops when tested on culturally adapted datasets compared to original datasets. Differences in tokenization for culturally specific terms can also impact model understanding and reasoning.

Proposed Solutions: Incorporating more diverse and representative training datasets to improve model performance across varied cultural contexts, and improving tokenizer efficiency to handle different languages and cultural vocabularies more effectively. Additionally, adopting models trained on diverse datasets and refining evaluation approaches to account for cultural variations.

Project Team

Aabid Karim

Researcher

Abdul Karim

Researcher

Bhoomika Lohana

Researcher

Matt Keon

Researcher

Jaswinder Singh

Researcher

Abdul Sattar

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18