Skip to main content Skip to navigation

Playing with words: Comparing the vocabulary and lexical diversity of ChatGPT and humans

Project Overview

The document explores the transformative role of generative AI, specifically tools like ChatGPT, in the field of education, particularly concerning language learning and vocabulary development. It examines the implications of AI-generated text on vocabulary usage and lexical diversity, indicating that while earlier iterations of ChatGPT produced text with lower lexical variety, advancements in newer versions have led to outputs that can match human levels of diversity. This evolution raises significant questions about how generative AI might influence language evolution and educational practices. The analysis underscores the necessity for continued research to understand the comprehensive effects of generative AI on language acquisition and its broader implications for learners and educators. Overall, the findings suggest that generative AI holds the potential to enhance language education, but they also call for careful consideration of its impact on linguistic diversity and learning outcomes.

Key Applications

ChatGPT

Context: Language learning and generation tasks; target audience includes students, educators, and researchers.

Implementation: Comparison of vocabulary and lexical diversity between human responses and ChatGPT-generated responses across various tasks.

Outcomes: Initial findings indicate that ChatGPT uses fewer distinct words and has lower lexical diversity compared to humans, although newer versions show improvement.

Challenges: The variability of vocabulary and diversity depending on task, person, AI model version, and configuration.

Implementation Barriers

Technical Barrier

Limitations of LLMs in generating diverse and human-like text due to their training data and model architecture. Additionally, there are challenges in evaluating the vocabulary and diversity of AI-generated text.

Proposed Solutions: Developing improved evaluation benchmarks and automated tools to analyze vocabulary and diversity in AI-generated text.

Research Barrier

Lack of comprehensive datasets comparing human and AI-generated text for diverse tasks, and specifically, datasets designed to evaluate vocabulary and lexical features of AI tools.

Proposed Solutions: Creating new datasets specifically designed to evaluate vocabulary and lexical features of AI tools.

Cultural Barrier

Potential bias towards dominant languages in AI training datasets, impacting minority language representation. Strategies need to be implemented to include diverse languages and dialects in training datasets.

Proposed Solutions: Implementing strategies to include diverse languages and dialects in training datasets.

Project Team

Pedro Reviriego

Researcher

Javier Conde

Researcher

Elena Merino-Gómez

Researcher

Gonzalo Martínez

Researcher

José Alberto Hernández

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Pedro Reviriego, Javier Conde, Elena Merino-Gómez, Gonzalo Martínez, José Alberto Hernández

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies