How ChatGPT Performed on University-Level Work

Home
1.Formal Report
2. Student Conversations
- 2.1 Exploring Student Survey on AI
  - 2.1.1 Ethical Concerns
  - 2.1.2 Impact on Academic Standing and Degree Value
  - 2.1.3 Student Attitudes and Acceptance of AI
  - 2.1.4 Integration of AI in Academic Assignments
  - 2.1.5 Usage Patterns and Trends
  - 2.1.6 Concerns and Future Directions
- 2.2 Student Perspectives on AI in Education
  - 2.2.1 Experiences and Attitudes Towards AI
  - 2.2.2 Ethical Considerations and Academic Integrity
  - 2.2.3 Impact on Learning and Skill Development
  - 2.2.4 The Future of AI in Education: Hopes and Fears
  - 2.2.5 Recommendations for AI Integration
3. How ChatGPT Performed on University-Level Work
- 3.1 Evaluating the Proficiency of Generative Artificial Intelligence in University-Level Mathematics and Statistics Problem-Solving
- 3.2 Assessment of Generative AI Answers From Module Leaders
4. Suggested Changes and Future Direction of Regulations
5 Opportunities AI Presents
- 5.1 Critically Analysing Outputs
6 Tips For Markers on Spotting Potential AI Usage

Generative AI has an expected score of 67% on mathematics and statistics assignments ranging from first-year to fourth-year level questions.
72% of students agree or strongly agree that "AI Usually Gives the Wrong Answer to Mathematics/Statistics Questions", whereas AI was able to solve 67% of questions to a first-class standard.
55% of students are using AI tools in assignments.
In January 2023, there were 764461 visits to ChatGPT on Warwick University’s Wi-Fi, with 42 online exams in this period.

Results

The graphs below break down how questions were scored. The following grading system was employed.

The marking system for the questions is at the bottom of the page.

Percentage of Questions in Each Scoring Category

Percentage of Questions In Each Scoring Category By Year

Note that the number of questions used in each year is as follows: Year 1 - 57, Year 2 - 14, Year 3/4 - 16.

The expected scores for each year using generative AI are as follows:

Year 1	Year 2	Year 3/4
73.82%	53.50%	55.38%

Based on summary statistics for statistic department students, the LLM was within a few per cent of the upper quartile for year 1 questions, and within a few percent of the lower quartile for year 2 and 3/4 questions.

Anthropic, the company behind the Claude range of LLM revealed test results for multiple LLMs on different questions.

This indicates that the best LLMs on the market at the moment can get 80%+ of undergraduate mathematics questions right. The model used in this study was ChatGPT-4o, which indicates it gets 88.7% of undergraduate mathematics correct. This is higher than what was found in the study but is not infeasible.

A traffic light system (green, yellow, red) was employed to categorise the quality of LLM solutions:

Green (Mark given would be in the range of 70%-100%):
- Indicates a perfect or near-perfect solution.
- If hypothetically produced by a student, it would demonstrate a complete understanding of the topic with little to no minute errors.
Yellow (Mark would be in the range of 35%-69%):
- Represents an adequate solution.
- If hypothetically produced by a student, it would show a good or satisfactory understanding of the topic with small errors.
Red (Mark would be in the range 0%-34%):
- Signifies a poor solution.
- If hypothetically produced by a student, it would indicate a lack of understanding of the topic with large errors.

This grading system provides a clear categorisation of the LLM's performance and by having it as a range allows for discrepancies in leniency in marking that may occur in real life.