Skip to main content Skip to navigation

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Project Overview

The document discusses the application of generative AI in education, particularly through the introduction of the Generative AI-enabled Theme Organization and Structuring (GATOS) workflow, which leverages open-source generative text models to enhance thematic analysis in social science research. This innovative approach seeks to automate the traditionally labor-intensive qualitative coding process, allowing researchers to analyze extensive text data more efficiently using machine learning techniques. The effectiveness of GATOS is validated through three case studies, which reveal its capability to produce qualitative codebooks that align closely with conventional thematic analysis methods. By streamlining the coding process, GATOS not only improves the efficiency of qualitative research but also enhances the accuracy and consistency of thematic interpretations, thereby offering significant benefits to educators and researchers engaged in social sciences. Overall, the findings highlight the transformative potential of generative AI in educational research, providing tools that facilitate deeper insights and support the analysis of complex qualitative data.

Key Applications

Generative AI-enabled Theme Organization and Structuring (GATOS) workflow

Context: Social science research analyzing large volumes of qualitative text data to identify themes and patterns.

Implementation: The GATOS workflow uses open-source generative text models to summarize text data, cluster similar ideas, and generate a codebook inductively.

Outcomes: Successfully identifies themes and sub-themes that align with traditional qualitative analysis, improving efficiency and scalability.

Challenges: Computational intensity and potential for redundancy in generated codes; reliance on synthetic data for validation.

Implementation Barriers

Technical barrier

The GATOS workflow is computationally intensive, especially when dealing with large datasets.

Proposed Solutions: Implementing iterative clustering and summarization techniques to reduce the volume of data to be processed.

Data limitation

The validation of the workflow relies on synthetic datasets, which may not capture the full complexity of human-generated qualitative data.

Proposed Solutions: Testing the workflow on real qualitative datasets collected from human subjects to ensure generalizability.

Bias in AI models

Generative text models may exhibit biases that affect the quality of coding and thematic analysis.

Proposed Solutions: Conduct regular evaluations and updates to the models used in the workflow to mitigate bias.

Project Team

Andrew Katz

Researcher

Gabriella Coloyan Fleming

Researcher

Joyce Main

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Andrew Katz, Gabriella Coloyan Fleming, Joyce Main

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies