2. The Emergence of ChatGPT and Limitations of GPT-3.5
- Home
- 1.Formal Report
- 1.1 Introduction to Project
- 1.2 The Emergence of ChatGPT and Limitations of GPT-3.5
- 1.3 Understanding LLMs and Evolution of AI Models
- 1.4 Extending LLM Capabilities and Introduction of ChatGPT o1
- 1.5 Performance of AI Models and Urgency for Institutional Action
- 1.7 Recognising the Problem and Specific Regulations
- 1.8 Recommendations and Conclusion
- 2. Student Conversations
- 3. How ChatGPT Performed on University-Level Work
- 4. Suggested Changes and Future Direction of Regulations
- 4.1 Developing Clear Policies on AI Use
- 4.2 Enhancing Student Support and Guidance
- 4.3 Emphasising Skills That AI Cannot Replicate
- 4.4 Adapting Pedagogy and Innovating Assessments
- 4.5 Encouraging Collaborative Solutions Among Stakeholders
- 4.6 Allocating Resources for Training and Support
- 4.7 Adopting Alternative Assessment Methods
- 4.8 Relying on Honour Codes and Academic Integrity Pledges
- 4.9 Designing AI-Resistant Assignments
- 4.10 Using AI Detection Software
- 4.11 Implementing Oral Examinations (VIVAs)
- 5 Opportunities AI Presents
- 6 Tips For Markers on Spotting Potential AI Usage
The Emergence of ChatGPT and Its Impact
ChatGPT, released on 30 November 2022, fundamentally altered the educational landscape overnight. Students could suddenly, instantly, and for free, access answers that far exceeded the general public's expectations of AI assistants at the time, which were primarily viewed as basic task managers or search tools like Siri or Google Assistant. This significant shift raised immediate concerns about the integrity of academic assessments, particularly in essay-based subjects where students could easily generate large portions or even entire assignments with minimal effort.
At the time of its release, ChatGPT was powered by a single Large Language Model (LLM): GPT-3.5. This model quickly became synonymous with the ChatGPT brand and remains, according to our study, the most popular version used by students nearly two years later, despite being shortly replaced by GPT-4o mini. GPT-3.5, like GPT-4o mini, was always offered for free with usage limits. It is an advanced large language model capable of generating and processing natural language. Trained on vast amounts of text data, primarily sourced from the internet, GPT-3.5 can address a wide range of topics with depth and fluency unseen in the rule-based systems that most were accustomed to for virtual assistants. It operates by predicting the next word in a sequence based on learned statistical patterns from training, which allows it to generate human-like text. However, it is important to note that while highly proficient at producing coherent responses, GPT-3.5 functioned, at least initially, as a sophisticated chatbot prediction engine rather than a tool optimised for accurate reasoning or mathematical computation. Its abilities in mathematics and coding were somewhat emergent, influenced by patterns seen in data and reinforced through fine-tuning with human feedback.
Limitations of GPT-3.5 in Mathematical Contexts
Despite the impressive scale of GPT-3.5 (175 billion parameters) and the diversity of its training data, its capabilities, like those of all LLMs, are bounded by the quality and scope of its training data, which often includes both accurate and inaccurate information. This limitation particularly affected its performance in mathematical contexts, where rigorous logic and structured reasoning are required through multiple steps. Essentially, LLMs are not calculators—they are neural network prediction engines. This distinction is crucial in understanding why, despite some competence, GPT-3.5 was not particularly suited for university-level mathematics, even with later improvements (GPT-3.5-turbo). Unlike a calculator, which processes mathematical operations through precise algorithms, GPT-3.5 and all LLMs approach mathematical questions in the same way they handle any other query—by predicting the most likely next word or token based on their training data. This method means that the model generates answers to mathematical problems through the same mechanism it uses for generating all text, often leading to errors and inconsistencies for what counterintuitively may seem like trivial problems, such as counting or basic mathematical operations.
As a result, many students who initially experimented with GPT-3.5 developed a negative perception of the capabilities of LLMs broadly, but particularly in mathematics. This sentiment arose from their experience with the model's frequent inaccuracies and "hallucinations", leading to a lack of trust in its outputs. Consequently, this eroded their confidence in the broader potential of AI tools in academic settings, especially for tasks requiring precise and reliable answers, reinforcing the perception that AI might be fundamentally flawed for such applications. OpenAI did little to alleviate these concerns, offering only a generic warning at the bottom of all chats that "ChatGPT can make mistakes", which hardly provided the clarity or reassurance students needed to confidently use the tool in their academic work.
Despite these limitations and worries, the majority of students in the Maths and Stats department admitted to using these AI models for help with their assignments, with most relying on GPT-3.5 at the time. GPT-3.5 and its subsequent refinements have proven to be competent in many instances. Anecdotally, mathematics students have noted its ability to not only write effective code but also execute it, although, in reality, the model is merely predicting the output based on learned patterns. Similarly, it could often "guess" calculations correctly when prompted, but it was not performing actual internal calculations as a calculator would. This phenomenon, where the model produces responses that seem accurate or correct but are underpinned by fundamentally flawed reasoning, is known as a "hallucination". As a result, students frequently expressed discomfort and found it challenging to trust the model’s outputs, particularly in mathematical and technical contexts where precision is crucial.
References
- OpenAI. (2022). Introducing ChatGPT. Retrieved from https://openai.com/index/chatgpt/Link opens in a new window
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv. https://doi.org/10.48550/arXiv.2311.05232Link opens in a new window