APTS CIS Lab 2 Solutions
Markov Chains and Monte Carlo
Paul Jenkins (p.jenkins@warwick.ac.uk)
Richard Everitt (richard.everitt@warwick.ac.uk)
May 2022

A warmup. In a simplified model of the game of Monopoly, we consider the motion of the piece around a loop of 40 spaces. We can model this as a Markov chain with a state space consisting of the integers \(0,\dots,39\) in which the transition kernel adds the result of two sixsided dice to the current state modulo 40 to obtain the new state.

Implement a piece of R code which simulates this Markov chain.
#Increment probabilities ivals < c(2,3,4,5,6,7,8,9,10,11,12) iprobs < c(1,2,3,4,5,6,5,4,3,2,1) / 36 rmonopoly < function(n=1,x0=0) { #Sample n increments is < sample(ivals, size=n, replace=TRUE, prob=iprobs) x < (x0 + cumsum(is)) %% 40 }

Run the code for a large number of iterations, say \(100,000\), and plot a histogram of the states visited.
Taking a little care over the location of the bins, noting that \(X\) lives in discrete space:
X < rmonopoly(1000000) hist(X,breaks=seq(0.5,39.5,1), freq=FALSE)

Based on the output of the chain, would you conjecture that there is an invariant distribution for this Markov chain? If so, what?
This histogram is quite strongly suggestive that the invariant distribution might be uniform. (Indeed this seems logical: the sum of many independent variables of finite variance will ultimately look very like a normal random variable and will have variance linearly increasing with the number of terms. Taking this value, modulo 40, it will wrap around many times and the resulting density will become increasingly flat.)

Write the transition kernel down mathematically.
There are lots of ways of doing this, letting \(K_{ij} := {\mathbb{P}}\left(X_{t+1} = j  X_{t} = i\right)\), we have: \[\begin{aligned} K_{ij} &= \frac{1}{36} \times\left\{ \begin{array}{ll} 1 & (j  i) \bmod 40 = 2 \text{ or } (j  i) \bmod 40 = 12 \\ 2 & (j  i) \bmod 40 = 3 \text{ or } (j  i) \bmod 40 = 11 \\ 3 & (j  i) \bmod 40 = 4 \text{ or } (j  i) \bmod 40 = 10 \\ 4 & (j  i) \bmod 40 = 5 \text{ or } (j  i) \bmod 40 = 9 \\ 5 & (j  i) \bmod 40 = 6 \text{ or } (j  i) \bmod 40 = 8 \\ 6 & (j  i) \bmod 40 = 7 \end{array} \right.\end{aligned}\]

Check whether the Markov kernel you have written down is invariant with respect to any distribution conjectured in part (c).
Let \(f_i = \frac{1}{40}\mathbb{I}_{\{0,\dots,39\}}(i)\) be the putative invariant distribution: \[\begin{aligned} \sum_i f_i K_{ij} &= \frac{1}{40} \sum_i K_{ij}\\ &= \frac{1}{40} \left( 1+2+3+4+5+6+5+4+3+2+1 \right) / 36 = 1/40 = f_j.\end{aligned}\] where the second equality follows by noticing that \((j  i) \bmod 40 = k\) for exactly one \(i\) for every pair \((j, k) \in \{0,\dots,39\}\otimes\{2,\dots,12\}\}\).


Recall the Poisson changepoint model discussed in lectures, and on p2122 of the supporting notes, and think about the following closely related model: Observations \(y_1,\dots,y_n\) comprise a sequence of \(M\) iid \({\textsf{N}\left( \mu_1,1 \right)}\) random variables followed by a second sequence of \(nM\) iid \({\textsf{N}\left( \mu_2,1 \right)}\) random variables. \(M\), \(\mu_1\) and \(\mu_2\) are unknown. The prior distribution over \(M\) is a discrete uniform distribution on \(\{1,\dots,n1\}\) (there is at least one observation of each component). The prior distribution over \(\mu_i\) (\(i=1,2\)) is \({\textsf{N}\left( 0,10^2 \right)}\). The three parameters are treated as being a priori independent.

Write down the joint density of \(y_1,\dots,y_n,\mu_1,\mu_2\) and \(M\), and obtain the posterior distribution of \(\mu_1,\mu_2\) and \(M\), up to proportionality, in as simple a form as you can.
\[\begin{aligned} p(y_1,\dots,y_n,\mu_1,\mu_2,M) &= p(M) p(\mu_1) p(\mu_2) p(y_1,\dots,y_n\mu_1,\mu_2,M)\\ &= \frac{1}{n1} \mathbb{I}_{\{1,\dots,n1\}}(M) \frac{1}{\sqrt{200\pi}} \exp\left(\frac{\mu_1^2}{2\cdot 10^2}\right) \frac{1}{\sqrt{200\pi}} \exp\left(\frac{\mu_2^2}{2\cdot 10^2}\right) \cdot\\ &\phantom{=} \prod_{i=1}^M \frac{1}{\sqrt{2\pi}} \exp\left(\frac{(y_i\mu_1)^2}{2}\right) \prod_{i=M+1}^n \frac{1}{\sqrt{2\pi}} \exp\left(\frac{(y_i\mu_2)^2}{2}\right)\\ &\propto \mathbb{I}_{\{1,\dots,n1\}}(M) \exp\left(\frac{\mu_1^2+\mu_2^2}{2\cdot 10^2}\right)\exp\left(\sum_{i=1}^M \frac{(y_i\mu_1)^2}{2}  \sum_{i=M+1}^n \frac{(y_i\mu_2)^2}{2}\right)\end{aligned}\]

Find the “full conditional” distributions of \(\mu_1\), \(\mu_2\) and \(M\). (i.e. the conditional distributions of each of these variables given all other variables).
\(M  y_1,\dots,y_n, \mu_1, \mu_2\) has a discrete distribution on \(1,\dots,n1\) with \[{{\mathbb{P}}_{}\left(M=m\right)} \propto \exp\left(\sum_{i=1}^m \frac{(y_i\mu_1)^2}{2}  \sum_{i=m+1}^n \frac{(y_i\mu_2)^2}{2}\right)\] which can be easily (if not cheaply) sampled from.
The two mean parameters can be readily seen to be normally distributed (complete the square in \(\mu_1\) or \(\mu_2\), respectively, in the exponential terms which involve each parameter), and: \[\begin{aligned} p(\mu_1\mu_2,M,y_1,\dots,y_n) &\propto \exp\left(\frac{\mu_1^2}{2\cdot 10^2}\right) \exp\left(\sum_{i=1}^M \frac{(y_i\mu_1)^2}{2} \right)\\ &\propto \exp\left(  \frac{\mu_1^2 + 100 M \mu_{1}^2  200 \mu_{1} \sum_{i=1}^M y_i}{2 \cdot 10^2}\right)\\ &\propto \exp\left(  \frac{\mu_1^2  200 \mu_1 \sum_{i=1}^M y_i / (1+100M)}{2 \cdot 100 / (1+100M)} \right)\\ &\propto \textsf{N} \left(\mu_1 ; \sum_{i=1}^M 100 y_i / (1+100M),100/(1+100M) \right)\end{aligned}\] and similarly for \(\mu_2\), but with the remaining \(nM\) observations: \[\begin{aligned} p(\mu_2\mu_1,M,y_1,\dots,y_n) &= \textsf{N} \left(\mu_2 ; \sum_{i=M+1}^n 100 y_i / (1+100(nM)),100/(1+100(nM)) \right)\end{aligned}\]

Implement a Gibbs sampler making use of these full conditional distributions in order to target the posterior distribution identified in part (b).
sample.mu < function(obs,n.obs) { mean < 100 * sum(obs) / (1+100*n.obs) var < 100 / (1+100 * n.obs) rnorm(1,mean,sqrt(var)) } sample.M < function(obs,n.obs,mu1,mu2) { log.p < c() for(i in 1:(n.obs1)) { log.p[i] < (sum((obs[1:i]mu1)^2) + sum((obs[(i+1):n.obs]mu2)^2)) / 2 } log.p < log.p  max(log.p) p < exp(log.p) p < p / sum(p) sample.int(n=n.obs1,size=1,replace=TRUE,prob=p) } gibbs.ncpm < function(n.it=10000, y, M = length(y)/2, mu1 = 1, mu2 = +1) { n < length(y) Ms < c() mu1s < c() mu2s < c() Ms[1] < sample.M(y,n,mu1,mu2) mu1s[1] < sample.mu(y[1:Ms[1]],Ms[1]) mu2s[1] < sample.mu(y[(Ms[1]+1):n],nMs[1]) for (i in 2:n.it) { Ms[i] < sample.M(y,n,mu1s[i1],mu2s[i1]) mu1s[i] < sample.mu(y[1:Ms[i]],Ms[i]) mu2s[i] < sample.mu(y[(Ms[i]+1):n],nMs[i]) } list(M = Ms, mu1 = mu1s, mu2=mu2s) }
The only potentially surprising feature in the above code is perhaps the simulation of \(M\). The use of logarithms and the subtraction of the maximum is solely for numerical reasons and is generally a good idea – combinations of very large / very small values can otherwise lead to strange behaviour in some circumstances. Google logsumexp for further details…

Simulate some data from the model for various parameter values and test your Gibbs sampler.
This works reasonably well for most reasonable parameter values. Here’s an example:
y < c(rnorm(23, 1,1),rnorm(17,2,1)) g < gibbs.ncpm(1E4,y,10,2,4)
which produces the following histograms for \(M\), \(\mu_1\) and \(\mu_2\), respectively:
Actually, it is a good idea to consider at least bivariate marginals. The
GGally
package makes this relatively easy to do.library('GGally') g$M < as.factor(g$M) gdf < data.frame(g) ggpairs(gdf,cardinality_threshold=40,diag=list(discrete=wrap("barDiag")),lower=list(combo=wrap(ggally_dot)))
produces the following:

How might you extend this algorithm if instead of a changepoint model you had a mixture model in which every observation is drawn from a mixture, i.e.: \[Y_1,\dots,Y_n \overset{\text{iid}}{\sim}p \textsf{N} \left(\cdot ; \mu_1,1 \right) + (1p) \textsf{N} \left(\cdot ; \mu_2,1 \right).\] (The likelihood is now \(\prod_{i=1}^n [p \textsf{N} \left(y_i ; \mu_1,1 \right) + (1p) \textsf{N} \left(y_i ; \mu_2,1 \right)]\), with \(p\), \(\mu_1\), and \(\mu_2\) unknown and \(M\) no longer a parameter of the model.)
Consider the following things:

The prior distribution over \(p\).
For Gibbs sampling to work we require conjugacy, a \({\textsf{U}{\left[0,1\right]}}\) prior would suffice, but is a special case of the Beta distribution, any member of which would lead to an implementable algorithm.

Any other variables you may need to introduce.
As in the examples considered in lectures, we’d need a latent allocation variable for every observation indicating which of the two components it’s treated as coming from in the completed model.

The resulting algorithm.
A standard Gibbs Sampler should work adequately once these prior distributions and latent variables have been specified.

