Skip to main content

Sequential Monte Carlo for SLFV

The problem we have in mind throughout the following is to infer parameters about the model (such as mutation rate or radius/impact laws) given data at time \(t=0\) consisting of locations and genetic types of a finite sample of individuals. We aim to treat the ancestral tree connecting the sample as a latent variable.

Sequential Monte Carlo uses importance sampling to generate a high-dimensional object (such as the ancestral tree of our sample of lineages) by iteratively generating its components using proposals dependent on earlier components. Each run (particle) of the SMC is also iteratively weighted during its generation facilitating dynamic resampling at stopping times: Here we measure the coefficient of variation when all the particles have coaleasced to fewer than a specified number of active lineages.


We try to generate a likelihood surface by simulating \(p_t^\Pi (\eta | \theta)\), the probability of an observation \(\eta\) given parameters \(\theta\) and environment \(\Pi\), using the proposal distribution
$$
\mathbb{Q}_\Pi(H_{-i}|H_{-(i-1)}, \theta) \propto \mathbb{P}_\Pi(H_{-(i-1)}|H_{-i}, \theta)
$$ in the Monte Carlo algorithm, where \(H_{-i}\) denotes the configuration of active lineages \(i\) events (spatial or mutation) into the past.

Due to the computational complexity of these calculations we consider the "quenched" SLFV where we first simulate an environment \(\omega \sim \Pi\), compute a conditional likelihood on this environment, and then estimate the full likelihood by Monte Carlo integration.

It can be shown that \(p_t^{ \omega }\) , the conditional likelihood, can be written as $$p_{ t_1 }^{ \omega }( \eta | \theta ) = \mathbb{ E }_{ \omega, \theta }\left[ \prod_{ k = 0 }^T f( t( k ), \eta( k ), i( k ) ) \right ]$$ where \(( t( k ), \eta( k ), i( k ) )_{ k \in \mathbb{ N } }\) is a Markov Chain with transitions $$( t, \eta, i ) \mapsto \begin{cases} ( t - \tilde{ t }, \eta - \{ x^j, \kappa^j \} + \{ x^j, \beta \}, i ) \text{ with density } \frac{ e^{ - m | \eta | \tilde{ t } } }{ m | \eta |^2 f( t, \eta, i ) } P_{ \beta \kappa^j } d\tilde{ t }\\ ( t_{ i + 1 } - t_i, \eta - J + \{ x, \kappa \} + ( | J | - 1 ) \{ \partial_s, \partial_t \}, i + 1 ) \text{ with density } \frac{ e^{ - m | \eta | t } }{ f( t, \eta, i ) } \frac{ u_i^k ( 1 - u_i )^{ n_i - k } }{ \pi r_i^2 } dx \end{cases}$$ and $$ f( t, \eta, i ) = \frac{ 1 - e^{ - m |\eta| t } }{ | \eta | } \sum_{ j = 1 }^{ | \eta | } \sum_{ \beta = 1 }^d P_{ \beta \kappa^j } + e^{ - m |\eta| t } \sum_{ \kappa = 1 }^d \sum_{ l = 0 }^{ n^{ \kappa }_i } { n_i \choose l } u_i^l ( 1 - u_i )^{ n_i - l } $$ where \(m\) is the mutation rate, \(d\) is the number of types, \(n_i\) is the number of lineages inside \(B_{ x_i }( r_i )\) and \(n_i^{\kappa}\) is the number of lineages inside \(B_{ x_i }( r_i )\) of type \(\kappa\).

The function \(f\) plays the dual role or being the normalising constant of the transition density of the Markov Chain and the ratio of normalising constants of the backwards-in-time proposal distribution \(\mathbb{ Q }( \cdot | \cdot )\) and the corresponding forwards-in-time sequential distribution \(\mathbb{ P }( \cdot | \cdot )\).

We can estimate the conditional likelihood using the unbiased importance sampling estimator $$\hat{ p }_{ t_1 }^{ \omega }( \eta | \theta ) = \frac{ 1 }{ N }\sum_{ j = 1 }^N \prod_{ k = 0 }^{ T_j } f( t_j( k ), \eta_j( k ), i_j( k ) )$$ From this, we can then perform a second Monte Carlo step to estimate the unconditional likelihood from M particles with environments \(\omega_i\) by the unbiased estimator: $$\widehat{ L }( \theta ; \eta ) := \frac{ 1 }{ M } \sum_{ i = 1 }^M p_{ t_1 }^{ \omega_i }( \eta | \theta )$$ which can be shown to have variance $$\text{Var}( \widehat{ L }( \theta; \eta ) ) \leq \frac{ N + \mathbb{ E }_{ \theta }\left[ ( d + 1 )^{ 2 ( E_{\eta} + M_{ \eta } ) }\right] }{ NM }$$

Unfortunately, the number of possible coalescent trees which are compatible with a sample of realistic size is vast and we have not been able to test this algorithm on any data set large enough to contain a detectable signal. Hence this summary is presented mainly for theoretical interest and to motivate further refinements, and we refer to the section on the faster Approximate Bayesian Computation for results from test problems.