Assessment of Generative AI Answers From Module Leaders
- Home
- 1.Formal Report
- 1.1 Introduction to Project
- 1.2 The Emergence of ChatGPT and Limitations of GPT-3.5
- 1.3 Understanding LLMs and Evolution of AI Models
- 1.4 Extending LLM Capabilities and Introduction of ChatGPT o1
- 1.5 A Step Change in AI Capabilities and Key Findings
- 1.6 Performance of AI Models and Urgency for Institutional Action
- 1.7 Recognising the Problem and Specific Regulations
- 1.8 Recommendations and Conclusion
- 2. Student Conversations
- 3. How ChatGPT Performed on University-Level Work
- 4. Suggested Changes and Future Direction of Regulations
- 4.1 Developing Clear Policies on AI Use
- 4.2 Enhancing Student Support and Guidance
- 4.3 Emphasising Skills That AI Cannot Replicate
- 4.4 Adapting Pedagogy and Innovating Assessments
- 4.5 Encouraging Collaborative Solutions Among Stakeholders
- 4.6 Allocating Resources for Training and Support
- 4.7 Adopting Alternative Assessment Methods
- 4.8 Relying on Honour Codes and Academic Integrity Pledges
- 4.9 Designing AI-Resistant Assignments
- 4.10 Using AI Detection Software
- 4.11 Implementing Oral Examinations (VIVAs)
- 5 Opportunities AI Presents
- 6 Tips For Markers on Spotting Potential AI Usage
Methodology
As another research point, separate answers were generated using the same methodology as before and the same LLM, ChatGPT-4o. Module leaders were asked to assess the outputs and asked to evaluate them based on 4 criteria:
- Correctness of Solutions (0-100%): Assess the accuracy of the AI's solutions as if they were human-written and submitted by a student.
- Similarity to Student Work (0-100%): Evaluate how closely the AI-generated solutions resemble typical student submissions.
- Detectability as AI-Generated (0-100%): Rate how easily the solution can be identified as AI-generated.
- Adaptability into Student Work (0-100%): Assess how easily the AI-generated output could be modified into what appears as independent student work.
Lecturers were also free to add comments to help understand their perspectives.
Data
The data breaks down as follows:
Year 1 | Year 2 | Year 3/4 | Total | |
5 | 9 | 21 | 35 | All |
2 | 6 | 12 | 20 | Proof |
3 | 3 | 9 | 15 | Applied |
The questions, like before were split into 2 groups: 'Proof', a question which is asking to prove or show something, or 'Applied', a question that asks to apply a concept in a more specific situation.
It is important to note the sample size as a whole, and the data being skewed more towards year 3/4.
Results
Below are graphs showing the results. The percentages shown are the mean of the corresponding values.
The LLM performed exceptionally well in year 1 compared to year 2,3/4. Year 1 is the only year where the LLM answered proof questions better than applied questions.
For year 2,3/4, the LLM's answers are approximately the same in terms of average mark achieved.
The year 1 questions were simple in comparison to later years, hence why the average mark achieved was high relative to the other years. In the applied question in year 1, it made a trivial sign error in an applied question but noticed the hallucination later on.
In years 2 and 3/4, the LLM's answers for proofs were noted in comments to often be vague and missing details.
It is important to note the small sample size, especially for years 1 and 2.
For years 2 and 3/4, the similarity for an LLM's answer to proof and applied questions is relatively similar. This means for years 2 and 3/4 the type of question is unlikely to have a big effect on the similarity of an LLM's answer to a student's work.
For year 1, the difference is large, with the answers by the LLM for applied work being more similar to students' work. The LLM's answers for proof questions in year 1 are quite low, being only 30%. For year 1 proofs it was noted in comments that explanations for rudimentary steps were excessive and for more complex steps, details were lacking.
Comments from lecturers highlighted the LLM's answers as often being verbose and having unusual grammar and lexicon.
In the LLM's answers for year 2, the LLM gave answers with the 'giveaways' mentioned previously, but the mistakes made tended to be similar to what a confused student would make.
LLM answers that were given high values for detectability tended to be very poor in logic and in some cases did not answer the question.
LLM answers which were given lower values for detectability were often described as good student answers. Short sentences and poor grammar were commented on as reducing detectability.
Module leaders noted that it would often be trivial for students to adapt answers and be undetectable. The 'giveaways' tended to be very obvious, and students were competent enough to identify them and resolve them by cutting unnecessary phrases and using a more standard style of writing.
In some cases, the mistakes made were similar to students meaning if students were to use the output and adapt it the mistake would not raise any flags. However, a lot of mistakes were very poor and often did not show any logic, raising flags.
LLM answers with lower adaptability values were often missing or had wrong logic. The LLM's answer to the harder parts of the question also may have been overly complicated meaning it would be harder for a student to decipher and adapt.
Combining With Previous Data
The graph on the right comes from combining the data from evaluating the proficiency of general artificial intelligence and the new data from above. The data breaks down as follows:
Year 1 | Year 2 | Year 3/4 | |
Number of Questions | 62 | 23 | 37 |
The data indicates that generative AI excels at year 1 questions, having 75.81% of answers being in the green category, compared to other years. Year 2 was the only year where there was a higher percentage of answers in the red category than in green.
When comparing to the previous data, year 1 had approximately the same percentage of questions in the green, yellow and red categories.
Year 2 saw approximately a 7% decrease in questions in the green category and approximately a 3% decrease in questions in the yellow category, with approximately a 10% increase in questions in the red category.
Year 3/4 saw approximately a 7% decrease in questions in the green category with approximately a 14% increase in questions in the yellow category and a 7% decrease in questions in the red category.
The changes are for the most part not unexpected given the increase in sample size. However, the amount of questions in the yellow and red category in year 2 is interesting. It is important to note that this is the smallest sample size and is a likely reason why the data appears as it does.
An interesting point of comparison is the expected scores:
Expected Score | 1st Year | 2nd Year | 3rd/4th Year |
From Past Data | 73.82% | 53.5% | 55.38% |
From Combined Data | 74.19% | 48.09% | 55.86% |
The expected scores are relatively similar for years 1 and 3/4, with year 2 having the biggest decrease. This is likely due to the increased sample size, with the new expected scores likely giving a more accurate reflection of generative AI's ability on questions.
The expected scores reflect the need for students to be able to not only verify outputs but critically analyse them, to assess if the answer is the most optimal it can be. It is likely with improvements in abstraction and reasoning these results will change and become more like a student, but they will still not be perfect, not unlike a student.