Skip to main content Skip to navigation

4. Extending LLM Capabilities and Introduction of ChatGPT o1

Extending LLM Capabilities through Code Execution

While LLMs are not calculators and do not perform arithmetic directly, they can generate and execute code that performs calculations and then use the results as part of their responses. For example, when presented with a mathematical problem, an LLM might generate Python code that solves the problem, execute the code in an external environment, and then include the calculated result in its final output. This process allows the model to handle tasks that require precise computation, bridging the gap between its natural language processing capabilities and more technical, computation-driven requirements. Through this method, LLMs extend their utility into areas where exact numerical accuracy is essential, even though they do not inherently perform these calculations themselves.

Despite these advancements, these techniques were not utilised in assessing AI capabilities in our study, which employed a zero-shot approach as an initial baseline. This means that the models were tested without any additional guidance, tools, or advanced prompting techniques. However, it's worth noting that using tools or more sophisticated methods, such as chain-of-thought (CoT) prompting, often improves LLM performance on complex reasoning tasks. Chain-of-thought prompting involves instructing the model to think through a problem step by step, which can enhance its ability to tackle tasks that require more nuanced reasoning. Although this approach was not part of our initial assessment, it is known to significantly boost the accuracy and reliability of AI outputs in scenarios requiring detailed, logical progression.

Advancements in System 2 Thinking and the Introduction of ChatGPT o1

On 17 September 2024, OpenAI announced:

"We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and maths.

Today, we are releasing the first of this series in ChatGPT and our API. This is a preview, and we expect regular updates and improvements."

These new models, GPT-o1, are trained to spend more time thinking through problems before they respond, much like a person would. They learn to refine their thinking process, try different strategies, and recognise their mistakes through ongoing training. This marks a significant step advancement over previous models like GPT-4o, which, while advanced, could still suffer from common LLM challenges such as lacking awareness of broader context, experiencing catastrophic forgetting in longer conversations, and generating hallucinations. These issues, to some extent, are inherent to all LLMs including o1.

While GPT-4 was enhanced with tools like code interpreters and Retrieval Augmented Generation (RAG) to work with user files, the introduction of ChatGPT o1 represents a significant shift towards more advanced "System 2" thinking, crucial for complex tasks that require sustained and focused reasoning. Although o1 is still based on a specialised version of GPT-4, it introduces a new layer of functionality by integrating chain-of-thought (CoT) reasoning directly into its operations, rather than relying solely on prompting techniques.

In previous models, chain-of-thought reasoning—where the AI is prompted to think through a problem step by step—was typically initiated by the user through specific prompts. However, with ChatGPT o1, this process has been embedded into the model's architecture and training. The model was trained to autonomously generate and follow step-by-step reasoning processes as part of its natural problem-solving workflow. This internal CoT mechanism allows the model to break down complex tasks into smaller, manageable steps, systematically working through each part to arrive at a more accurate solution.

Additionally, ChatGPT o1 incorporates secondary models specifically trained to review and check for errors in reasoning. These models act as internal validators, ensuring that the reasoning steps generated by the primary model are coherent and logical. This multi-step process allows o1 to maintain a higher level of accuracy and reliability in its outputs, reducing the likelihood of errors or “hallucinations” that might arise from flawed reasoning. By embedding these capabilities into the model itself, ChatGPT o1 offers a more robust approach to handling complex tasks, moving beyond the limitations of external tools and prompting methods.

Mathematician Terence Tao shared his experiences with the o1 model:

"I have played a little bit with OpenAI's new iteration, GPT-o1, which performs an initial reasoning step before running the LLM. It is certainly a more capable tool than previous iterations, though still struggling with the most advanced research mathematical tasks.

... the new model could work its way to a correct (and well-written) solution if provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student."

This suggests that while o1 is not yet at the level of a human expert, it represents a significant improvement in reasoning capabilities compared to previous models. Unlike earlier models that might frequently escape their "vacuum of coherence" or struggle to correct their errors, o1 demonstrates an enhanced ability to "reason" through complex tasks, thanks to its external reasoning steps.

However, the key aspect of professional LLM usage is the need for a competent human to steer the process and validate the output, informed and aware of the nuances and limitations of the selected model for their use case.

References

  1. OpenAI. (2024). Code Interpreter Beta. Retrieved from https://platform.openai.com/docs/assistants/tools/code-interpreterLink opens in a new window
  2. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv. https://arxiv.org/abs/2312.10997Link opens in a new window
  3. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv. https://arxiv.org/abs/2201.11903Link opens in a new window
  4. OpenAI. (2024). Introducing OpenAI o1-preview: A New Series of Reasoning Models for Solving Hard Problems. Retrieved from https://openai.com/index/introducing-openai-o1-preview/Link opens in a new window
  5. Tao, T. (2024). Experiments with GPT-o1. Retrieved from https://mathstodon.xyz/@tao/113132502735585408Link opens in a new window