Skip to main content Skip to navigation

Assessment of Generative AI Answers From Module Leaders

Methodology

As another research point, separate answers were generated using the same methodology as before and the same LLM, ChatGPT-4o. Module leaders were asked to assess the outputs and asked to evaluate them based on 4 criteria:

  1. Correctness of Solutions (0-100%): Assess the accuracy of the AI's solutions as if they were human-written and submitted by a student.
  2. Similarity to Student Work (0-100%): Evaluate how closely the AI-generated solutions resemble typical student submissions.
  3. Detectability as AI-Generated (0-100%): Rate how easily the solution can be identified as AI-generated.
  4. Adaptability into Student Work (0-100%): Assess how easily the AI-generated output could be modified into what appears as independent student work.

Lecturers were also free to add comments to help understand their perspectives.

Data

The data breaks down as follows:

Year 1 Year 2 Year 3/4 Total  
5 9 21 35 All
2 6 12 20 Proof
3 3 9 15 Applied

The questions, like before were split into 2 groups: 'Proof', a question which is asking to prove or show something, or 'Applied', a question that asks to apply a concept in a more specific situation.

It is important to note the sample size as a whole, and the data being skewed more towards year 3/4.

Results

Below are graphs showing the results. The percentages shown are the mean of the corresponding values.

Correctness of the LLM answers.

The LLM performed exceptionally well in year 1 compared to year 2,3/4. Year 1 is the only year where the LLM answered proof questions better than applied questions.

For year 2,3/4, the LLM's answers are approximately the same in terms of average mark achieved.

The year 1 questions were simple in comparison to later years, hence why the average mark achieved was high relative to the other years. In the applied question in year 1, it made a trivial sign error in an applied question but noticed the hallucination later on.

In years 2 and 3/4, the LLM's answers for proofs were noted in comments to often be vague and missing details.

It is important to note the small sample size, especially for years 1 and 2.

For years 2 and 3/4, the similarity for an LLM's answer to proof and applied questions is relatively similar. This means for years 2 and 3/4 the type of question is unlikely to have a big effect on the similarity of an LLM's answer to a student's work.

For year 1, the difference is large, with the answers by the LLM for applied work being more similar to students' work. The LLM's answers for proof questions in year 1 are quite low, being only 30%. For year 1 proofs it was noted in comments that explanations for rudimentary steps were excessive and for more complex steps, details were lacking.

Similarity of LLM's output to students work.
Detectability of work being produced by an LLM.

Comments from lecturers highlighted the LLM's answers as often being verbose and having unusual grammar and lexicon.

In the LLM's answers for year 2, the LLM gave answers with the 'giveaways' mentioned previously, but the mistakes made tended to be similar to what a confused student would make.

LLM answers that were given high values for detectability tended to be very poor in logic and in some cases did not answer the question.

LLM answers which were given lower values for detectability were often described as good student answers. Short sentences and poor grammar were commented on as reducing detectability.

Module leaders noted that it would often be trivial for students to adapt answers and be undetectable. The 'giveaways' tended to be very obvious, and students were competent enough to identify them and resolve them by cutting unnecessary phrases and using a more standard style of writing.

In some cases, the mistakes made were similar to students meaning if students were to use the output and adapt it the mistake would not raise any flags. However, a lot of mistakes were very poor and often did not show any logic, raising flags.

LLM answers with lower adaptability values were often missing or had wrong logic. The LLM's answer to the harder parts of the question also may have been overly complicated meaning it would be harder for a student to decipher and adapt.

Adaptability of an LLM's answer to student's work.

Combining With Previous Data

Percentage of questions in each scoring category.

The graph on the right comes from combining the data from evaluating the proficiency of general artificial intelligence and the new data from above. The data breaks down as follows:

  Year 1 Year 2 Year 3/4
Number of Questions 62 23 37
This is an improvement on the previous data set from evaluating the proficiency of general artificial intelligence, where there were 57 year 1 questions, 14 year 2 questions and 16 year 3/4 questions.

The data indicates that generative AI excels at year 1 questions, having 75.81% of answers being in the green category, compared to other years. Year 2 was the only year where there was a higher percentage of answers in the red category than in green.

When comparing to the previous data, year 1 had approximately the same percentage of questions in the green, yellow and red categories.

Year 2 saw approximately a 7% decrease in questions in the green category and approximately a 3% decrease in questions in the yellow category, with approximately a 10% increase in questions in the red category.

Year 3/4 saw approximately a 7% decrease in questions in the green category with approximately a 14% increase in questions in the yellow category and a 7% decrease in questions in the red category.

The changes are for the most part not unexpected given the increase in sample size. However, the amount of questions in the yellow and red category in year 2 is interesting. It is important to note that this is the smallest sample size and is a likely reason why the data appears as it does.

An interesting point of comparison is the expected scores:

Expected Score 1st Year 2nd Year 3rd/4th Year
From Past Data 73.82% 53.5% 55.38%
From Combined Data 74.19% 48.09% 55.86%

The expected scores are relatively similar for years 1 and 3/4, with year 2 having the biggest decrease. This is likely due to the increased sample size, with the new expected scores likely giving a more accurate reflection of generative AI's ability on questions.

The expected scores reflect the need for students to be able to not only verify outputs but critically analyse them, to assess if the answer is the most optimal it can be. It is likely with improvements in abstraction and reasoning these results will change and become more like a student, but they will still not be perfect, not unlike a student.