Self-improvement of Large Language Models by Aligning Generation with Planning - Haijiang Yan

One of the most striking capabilities of Large Language Models (LLMs) is their apparent ability to refine outputs through a process of self-improvement. However, the mechanisms by which an autoregressive model acquires this capacity remain poorly understood, limiting the development of principled approaches to self-improvement. Drawing on a substantial body of prior work, we propose a mechanistic model of LLM self-improvement, grounded in a Bayesian perspective on token generation. In this view, LLMs maintain latent plans for future token generations, yet their realizations are initially biased by token-level priors encoded from training data. As more self-generated tokens are autoregressively incorporated back into the context, the generations gradually stabilize and become progressively debiased. Thus, incorporating self-generated sequences into context leads to more rational and coherent output. Building on these insights, we introduce self-play Markov Chain Monte Carlo (spMCMC), an extension of MCMC-with-People designed to elicit reliable and fine-grained reward signals for self-improvement in open-ended text generation.

The method proceeds in two stages. In the random generation stage, LLMs are encouraged to output as many as random answers given the same query to derive a sample space. Then in the sampling stage, LLM’s choice is integrated as the acceptance function in the Metropolis–Hasting’s algorithm. Specifically, the LLM is prompted to make a binary choice between two candidate passages of text in each run. After convergence, the sequence of selected passages can be interpreted as samples from the LLM’s internal representation of a rational plan.

For intended outcomes, we hypothesized that spMCMC would uncover higher-quality passages more reliably than other self-evaluation approaches, achieving closer alignment with aggregated human rating.

Project Outcomes:

To verify whether spMCMC is able to effectively identify signal of self-improvement in text generations, we performed spMCMC in a joke creation task to evaluate whether spMCMC better aligns with people's subjective judgments of joke funniness compared to other self-evaluation methods. We conducted the following experiment. In total, 237 unique jokes were generated by the Llama-3.1-8B-Instruct model during the random generation stage, forming the sample space for all LLM-based evaluation methods. Running spMCMC with the same model for 5000 iterations yielded an updated frequency distribution over the sample space, with some jokes being selected more frequently than others. For comparison, we also evaluated two additional self-evaluation methods using the same LLM: (i) direct rating and (ii) importance sampling. For direct rating, each joke was assigned an integer score between 1 and 7. For importance sampling, the LLM was instead prompted to make a binary judgment of whether each joke was funny or not.

To collect human rating on these jokes, we conducted an online experiment recruiting 100 participants from Prolific platform (50 males and 50 females who are fluent English speakers). Each participant rated the jokes on a 7-point scale of funniness. The mean score of each joke was taken as its human-assessed funniness. Participants were compensated at a flat rate of £9 per hour.

The results show a significant positive correlation (Pearson's r=0.23, p<.001) between spMCMC-derived joke frequency and human-assessed funniness of the jokes, indicating that jokes preferred by spMCMC are also more likely to be rated as funnier by humans. Notably, both spMCMC and human evaluation converge on the same joke as the funniest, while also agreeing that the joke produced via greedy decoding with maximum token-level likelihood is not the funniest at the sequence level. Moreover, spMCMC outperforms importance sampling and direct rating in identifying the jokes rated most highly by humans. The results highlight the potential of using such a Markovian process as an efficient self-improvement paradigm.

This project has provided support for further interdisciplinary work: proposing a general reasoning framework to implement the spMCMC process more efficiently in diverse domains.