How ChatGPT Performed on University-Level Work
- Home
- 1.Formal Report
- 1.1 Introduction to Project
- 1.2 The Emergence of ChatGPT and Limitations of GPT-3.5
- 1.3 Understanding LLMs and Evolution of AI Models
- 1.4 Extending LLM Capabilities and Introduction of ChatGPT o1
- 1.5 A Step Change in AI Capabilities and Key Findings
- 1.6 Performance of AI Models and Urgency for Institutional Action
- 1.7 Recognising the Problem and Specific Regulations
- 1.8 Recommendations and Conclusion
- 2. Student Conversations
- 3. How ChatGPT Performed on University-Level Work
- 4. Suggested Changes and Future Direction of Regulations
- 4.1 Developing Clear Policies on AI Use
- 4.2 Enhancing Student Support and Guidance
- 4.3 Emphasising Skills That AI Cannot Replicate
- 4.4 Adapting Pedagogy and Innovating Assessments
- 4.5 Encouraging Collaborative Solutions Among Stakeholders
- 4.6 Allocating Resources for Training and Support
- 4.7 Adopting Alternative Assessment Methods
- 4.8 Relying on Honour Codes and Academic Integrity Pledges
- 4.9 Designing AI-Resistant Assignments
- 4.10 Using AI Detection Software
- 4.11 Implementing Oral Examinations (VIVAs)
- 5 Opportunities AI Presents
- 6 Tips For Markers on Spotting Potential AI Usage
- Generative AI has an expected score of 67% on mathematics and statistics assignments ranging from first-year to fourth-year level questions.
- 72% of students agree or strongly agree that "AI Usually Gives the Wrong Answer to Mathematics/Statistics Questions", whereas AI was able to solve 67% of questions to a first-class standard.
- 55% of students are using AI tools in assignments.
- In January 2023, there were 764461 visits to ChatGPT on Warwick University’s Wi-Fi, with 42 online exams in this period.
Results
The graphs below break down how questions were scored. The following grading system was employed.
The marking system for the questions is at the bottom of the page.
Note that the number of questions used in each year is as follows: Year 1 - 57, Year 2 - 14, Year 3/4 - 16.
The expected scores for each year using generative AI are as follows:
Year 1 | Year 2 | Year 3/4 |
---|---|---|
73.82% | 53.50% | 55.38% |
Based on summary statistics for statistic department students, the LLM was within a few per cent of the upper quartile for year 1 questions, and within a few percent of the lower quartile for year 2 and 3/4 questions.
Anthropic, the company behind the Claude range of LLM revealed test results for multiple LLMs on different questions.
This indicates that the best LLMs on the market at the moment can get 80%+ of undergraduate mathematics questions right. The model used in this study was ChatGPT-4o, which indicates it gets 88.7% of undergraduate mathematics correct. This is higher than what was found in the study but is not infeasible.
A traffic light system (green, yellow, red) was employed to categorise the quality of LLM solutions:
- Green (Mark given would be in the range of 70%-100%):
- Indicates a perfect or near-perfect solution.
- If hypothetically produced by a student, it would demonstrate a complete understanding of the topic with little to no minute errors.
- Yellow (Mark would be in the range of 35%-69%):
- Represents an adequate solution.
- If hypothetically produced by a student, it would show a good or satisfactory understanding of the topic with small errors.
- Red (Mark would be in the range 0%-34%):
- Signifies a poor solution.
- If hypothetically produced by a student, it would indicate a lack of understanding of the topic with large errors.
This grading system provides a clear categorisation of the LLM's performance and by having it as a range allows for discrepancies in leniency in marking that may occur in real life.