3. Understanding LLMs and Evolution of AI Models
- Home
- 1.Formal Report
- 1.1 Introduction to Project
- 1.2 The Emergence of ChatGPT and Limitations of GPT-3.5
- 1.3 Understanding LLMs and Evolution of AI Models
- 1.4 Extending LLM Capabilities and Introduction of ChatGPT o1
- 1.5 A Step Change in AI Capabilities and Key Findings
- 1.6 Performance of AI Models and Urgency for Institutional Action
- 1.7 Recognising the Problem and Specific Regulations
- 1.8 Recommendations and Conclusion
- 2. Student Conversations
- 3. How ChatGPT Performed on University-Level Work
- 4. Suggested Changes and Future Direction of Regulations
- 4.1 Developing Clear Policies on AI Use
- 4.2 Enhancing Student Support and Guidance
- 4.3 Emphasising Skills That AI Cannot Replicate
- 4.4 Adapting Pedagogy and Innovating Assessments
- 4.5 Encouraging Collaborative Solutions Among Stakeholders
- 4.6 Allocating Resources for Training and Support
- 4.7 Adopting Alternative Assessment Methods
- 4.8 Relying on Honour Codes and Academic Integrity Pledges
- 4.9 Designing AI-Resistant Assignments
- 4.10 Using AI Detection Software
- 4.11 Implementing Oral Examinations (VIVAs)
- 5 Opportunities AI Presents
- 6 Tips For Markers on Spotting Potential AI Usage
Understanding How LLMs Work
To further illustrate the nuances of how Large Language Models like GPT-3.5 answer questions, consider the analogy of System 1 and System 2 thinking from psychology. LLMs operate almost like System 1 thinking—constantly generating text based on the input they receive. They process this input by converting each token (which can be loosely thought of as a sub-word) into vectors and then compute through their neural network weights to predict the next token in the sequence.
After predicting one token, the model repeats this entire process to predict the next token, using the previous output as part of the new input. This cycle of generating, processing, and predicting continues with each successive token, ensuring that each word in a sentence is interconnected and draws from the context established by all previous tokens. The model’s attention mechanism plays a crucial role here, allowing it to dynamically focus on different parts of the input text to maintain coherence and relevance across longer sequences.
This repetitive process, driven by the model's learned patterns from the training data, enables the LLM to generate coherent and contextually appropriate text across a range of topics. However, this same mechanism also explains why LLMs can struggle with tasks requiring precise reasoning, such as complex mathematical calculations. Since the model approaches every problem through probabilistic text prediction rather than deterministic computation, it often leads to errors and inconsistencies, particularly in domains where accuracy is critical.
You can think of an LLM's intelligence as being in a vacuum that the human must guide. Each prompt you provide samples a probability distribution within the model’s weights, and it is up to the user to explore and determine whether the output matches their needs and use case. It is understandable that students struggled with these nuances, especially since GPT-3.5 was the first capable and widely available general-purpose chatbot that was not rule-based, making it unparalleled in comparison, especially for academic applications both practically and ethically.
For example, it is not uncommon for LLMs to be unaware of their own capabilities and limitations. Due to reinforcement learning from human feedback, they often exhibit a strong tendency to prioritise pleasing the user over providing accurate responses. Unlike a human expert, who might push back against flawed ideas or suggestions, LLMs may struggle to do so, making it crucial for the user to critically evaluate the model's outputs. This is particularly important because the models themselves do not inherently understand their limitations or the context of their responses beyond what is provided by the user and OpenAI in the system prompt.
Evolution of AI Models
AI models have evolved through two primary paradigms: scaling up the model's size and enhancing algorithmic efficiencies.
1. Scaling Up
The approach of scaling up focuses on increasing the number of parameters within a model. This method enhances the model's capacity to capture and represent complex patterns within the data. The larger the model, the more detailed and nuanced its understanding of the data can be. For instance, GPT-4, released in March 2023, is speculated to have around 1.2 trillion parameters (8 x 220 billion), though not all are used simultaneously, as the Mixture of Experts (MoE) architecture dynamically selects which "experts" or parameters are active based on the task at hand. This architecture allows GPT-4 to handle complex reasoning tasks and process large amounts of contextual information more effectively than smaller models. In contrast, most commercial AI models typically operate with far fewer parameters, with 70 billion being a common upper limit. Despite this, open-weight models like LLaMA 3.1, which has 405 billion parameters, have demonstrated competitive GPT-4-level performance. However, these models require substantial hardware resources to run independently. Although 70-billion-parameter models could feasibly be run offline locally by students with as little as 64 GB of RAM, performance would suffer due to quantisation, and it is unlikely that many students have access to the necessary hardware, as VRAM is crucial for optimal inference speeds, and RAM alone is slower.
2. Enhancing Algorithmic Efficiencies
Enhancing algorithmic efficiencies involves optimising the way models process and generate information, allowing them to achieve high performance without necessarily increasing their size. In many cases, these techniques are used to train smaller models that can deliver similar performance to larger ones. Models such as GPT-4o and GPT-4-Turbo have incorporated these efficiencies by leveraging advancements like synthetic data, optimised training processes, and improved computational and fine-tuning methods. These improvements have led to faster processing times and more accurate results, all while keeping the model size manageable.
Due to the scale of these models, GPT-4 level models were typically offered as a paid service at $20 per month. However, with the application of these algorithmic efficiencies, GPT-4o was made available to free users with a trial, democratising access to advanced AI capabilities. Similarly, GPT-4o mini, which embodies these algorithmic advancements, delivers impressive performance despite its presumably smaller parameter size.
It is important to note that GPT-4o was the most advanced model available during our study and was the one used to assess AI performance on assignments.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv. https://arxiv.org/abs/1706.03762Link opens in a new window
- OpenAI. (2023). GPT-4: OpenAI’s Most Advanced System. Retrieved from https://openai.com/index/gpt-4/Link opens in a new window
- OpenAI. (2024). New Models and Developer Products Announced at DevDay. Retrieved from https://openai.com/index/new-models-and-developer-products-announced-at-devday/Link opens in a new window
- OpenAI. (2024). Hello GPT-4o: Our New Flagship Model. Retrieved from https://openai.com/index/hello-gpt-4o/Link opens in a new window
- Meta. (2024). Introducing Llama 3.1: Our Most Capable Models to Date. Retrieved from https://ai.meta.com/blog/meta-llama-3-1/Link opens in a new window