Critically Analysing Outputs
- Home
- 1.Formal Report
- 1.1 Introduction to Project
- 1.2 The Emergence of ChatGPT and Limitations of GPT-3.5
- 1.3 Understanding LLMs and Evolution of AI Models
- 1.4 Extending LLM Capabilities and Introduction of ChatGPT o1
- 1.5 A Step Change in AI Capabilities and Key Findings
- 1.6 Performance of AI Models and Urgency for Institutional Action
- 1.7 Recognising the Problem and Specific Regulations
- 1.8 Recommendations and Conclusion
- 2. Student Conversations
- 3. How ChatGPT Performed on University-Level Work
- 4. Suggested Changes and Future Direction of Regulations
- 4.1 Developing Clear Policies on AI Use
- 4.2 Enhancing Student Support and Guidance
- 4.3 Emphasising Skills That AI Cannot Replicate
- 4.4 Adapting Pedagogy and Innovating Assessments
- 4.5 Encouraging Collaborative Solutions Among Stakeholders
- 4.6 Allocating Resources for Training and Support
- 4.7 Adopting Alternative Assessment Methods
- 4.8 Relying on Honour Codes and Academic Integrity Pledges
- 4.9 Designing AI-Resistant Assignments
- 4.10 Using AI Detection Software
- 4.11 Implementing Oral Examinations (VIVAs)
- 5 Opportunities AI Presents
- 6 Tips For Markers on Spotting Potential AI Usage
Case Study: Critically Analysing Outputs
Below is an example of an LLM output from the analysis of LLM's ability to answer mathematics and statistics questions. This is a question from first-year statistics.
Question
The LLM was asked the question above and gave the output seen on the right.
A student could be given the questions and the LLM output and then asked to critically evaluate the LLM output. Critically evaluating an LLM's output is not just checking if it is right but also evaluating the logic (approaches taken and steps).
LLM Output
Below is an example of a critical evaluation of the LLM output above.
For question 1, the first thing that should stand out is the fact the probability given is 25/12, which is wrong as it is bigger than 1, and also approximates 25/12 as 0.6944 which is lower than 1. The approach to calculating the probability is incorrect, it appears to be attempting to evaluate the probability of one dice being any value and the other two being any value other than the first, but this does not consider the fact that one dice needs to be a unique highest value.
For question 2, the error of the probability of a unique highest number is carried over. It does correctly identify to use of the complement of the event to calculate the probability of no unique highest number occurring, but 1 - 25/12 = -13/12 not 11/36, which is not a feasible value. Whilst it did correctly state that for there to be no unique highest number for n trials, it is the probability of no unique highest number occurring to the power n, it did not identify the fact that it was using the independence of the events as asked to in the question. The steps taken were large and key information is missing.
For question 3, it appears to have taken the probability of a unique highest number occurring now as 25/36, by calculating the complement of the event that no unique highest number occurs, which was given as 11/36. The geometric distribution is not covered in the module so is not an appropriate method to use. It would have been more appropriate to use the independence of the events to show that the joint probability is equal to the product of the probabilities. Each trial where no unique highest number is obtained is the same, this can be re-written as the probability of no unique highest number occurring in the power n-1.
For question 4, the probability of not all the same is correct but includes the possibility of a unique highest value occurring, hence the final answer is wrong. The logic is correct but is very poorly explained, with large steps missing and not stating definitions and theorems not stated where applied as would be expected.
For the answers as a whole, for a lot of steps the jump taken from one step to the next was too large and more steps/detail is required. Further to this the question stated to make clear where independence was used which it was not. Additionally, definitions and theorems used were not identified when they were being used. The notation, logic and layout of the answer are not of the standard of the module or a university student.
Additionally, a student should be able to answer the question correctly and appropriately to the module at an undergraduate level. This could be given as an additional question.