A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case
Project Overview
The document explores the integration of generative AI, particularly large language models (LLMs), in education, drawing parallels with their applications in the travel industry for personalized experiences. It highlights the advancements in fine-tuning techniques, such as Quantized Low Rank Adapter (QLoRA) and Retrieval-Augmented Fine Tuning (RAFT), which enhance model performance by improving data processing capabilities. Specific focus is given to models like LLaMa and Mistral, showcasing how these models can be adapted for educational purposes, providing tailored insights and support for learners. The findings indicate that while Mistral RAFT shows superior performance compared to other methods, traditional evaluation metrics often fall short of capturing the nuanced effectiveness of these models as perceived by humans, stressing the importance of human evaluation in assessing AI capabilities. The discussion also addresses challenges, including data quality and the necessity for diverse training datasets, while underscoring the significant potential of generative AI to transform educational contexts by simulating real-world interactions and enhancing learning experiences. Overall, the document emphasizes that despite existing hurdles, generative AI holds promise for revolutionizing educational practices through personalized, context-aware applications.
Key Applications
Travel Data Processing and Insights Generation
Context: Educational applications in AI research, focusing on travel industry data to provide personalized recommendations and insights for travelers. This includes leveraging user-generated content from platforms like Reddit to enhance understanding and engagement in travel-related topics.
Implementation: Fine-tuning LLaMa 2 and Mistral models using QLoRA and RAFT methodologies. This involves sourcing datasets from travel discussions, employing techniques to augment data quality and training efficiency, followed by Reinforcement Learning from Human Feedback (RLHF) to optimize model performance.
Outcomes: Significantly improved user engagement and recommendation accuracy, enhanced model performance in generating relevant travel insights, faster processing times, and reduced inference costs. The Mistral RAFT RLHF model was recognized as the best performing in its category.
Challenges: Dependence on high-quality training data, computational resource demands, ensuring accurate and contextually relevant responses, and the need for diverse training datasets.
Implementation Barriers
Technical & Resource Limitations
Traditional NLP metrics do not capture the complexities of LLMs and often misalign with human evaluations. Additionally, high computational costs and hardware requirements for fine-tuning large models present challenges.
Proposed Solutions: Utilizing human evaluation and advanced metrics like those provided by OpenAI GPT-4 to assess model performance, as well as adopting efficient fine-tuning methodologies like QLoRA to reduce the computational burden.
Data Quality
The necessity for high-quality, diverse training datasets to improve model fine-tuning, which requires strict quality standards in data collection.
Proposed Solutions: Implementing strict quality standards in data collection and possibly utilizing web scraping for real-time data.
Model Complexity
Challenges in tuning hyperparameters and ensuring efficient processing without noise in outputs.
Proposed Solutions: Experimenting with embedding models, pre-writing prompt templates, and other strategies to improve model predictions.
Project Team
Sonia Meyer
Researcher
Shreya Singh
Researcher
Bertha Tam
Researcher
Christopher Ton
Researcher
Angel Ren
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Sonia Meyer, Shreya Singh, Bertha Tam, Christopher Ton, Angel Ren
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai