Skip to main content Skip to navigation

How ChatGPT Performed on University-Level Work


  • Generative AI has an expected score of 67% on mathematics and statistics assignments ranging from first-year to fourth-year level questions.
  • 72% of students agree or strongly agree that "AI Usually Gives the Wrong Answer to Mathematics/Statistics Questions", whereas AI was able to solve 67% of questions to a first-class standard.
  • 55% of students are using AI tools in assignments.
  • In January 2023, there were 764461 visits to ChatGPT on Warwick University’s Wi-Fi, with 42 online exams in this period.
Results

The graphs below break down how questions were scored. The following grading system was employed.

The marking system for the questions is at the bottom of the page.

Percentage of Questions in Each Scoring Category
Percentage of Questions In Each Scoring Category By Year

Note that the number of questions used in each year is as follows: Year 1 - 57, Year 2 - 14, Year 3/4 - 16.

The expected scores for each year using generative AI are as follows:

Year 1 Year 2 Year 3/4
73.82% 53.50% 55.38%

Based on summary statistics for statistic department students, the LLM was within a few per cent of the upper quartile for year 1 questions, and within a few percent of the lower quartile for year 2 and 3/4 questions.

Anthropic LLM testing results

Anthropic, the company behind the Claude range of LLM revealed test results for multiple LLMs on different questions.

This indicates that the best LLMs on the market at the moment can get 80%+ of undergraduate mathematics questions right. The model used in this study was ChatGPT-4o, which indicates it gets 88.7% of undergraduate mathematics correct. This is higher than what was found in the study but is not infeasible.

A traffic light system (green, yellow, red) was employed to categorise the quality of LLM solutions:

  1. Green (Mark given would be in the range of 70%-100%):
    • Indicates a perfect or near-perfect solution.
    • If hypothetically produced by a student, it would demonstrate a complete understanding of the topic with little to no minute errors.
  2. Yellow (Mark would be in the range of 35%-69%):
    • Represents an adequate solution.
    • If hypothetically produced by a student, it would show a good or satisfactory understanding of the topic with small errors.
  3. Red (Mark would be in the range 0%-34%):
    • Signifies a poor solution.
    • If hypothetically produced by a student, it would indicate a lack of understanding of the topic with large errors.

This grading system provides a clear categorisation of the LLM's performance and by having it as a range allows for discrepancies in leniency in marking that may occur in real life.