--- title: "APTS Statistical Modelling" subtitle: "Lecture Discussion 2" date: "April 2021" author: Helen Ogden output: beamer_presentation: includes: in_header: preamble.tex --- ## Lecture discussion 2 - Please mute your microphone and turn off video, unless you are asking a question. - Feel free to ask questions about any of the lecture content so far. - Use the chat to ask questions directly, or to state that you have a question. ## How should we model this data? ```{r, echo = FALSE, fig.width = 5, fig.height = 4} set.seed(1) n <- 200 x <- seq(-2, 2, length.out = n) z <- rnorm(n) beta0 <- 2 beta1 <- 0.25 beta2 <- 0.5 eta <- beta0 + beta1 * x + beta2 * z mu <- exp(eta) y <- rpois(n, mu) plot(y ~ x) ``` ## Fitting a Poisson GLM We can fit the Poisson log-linear model \[Y_i \sim \text{Poisson}(\mu_i), \quad \log \mu_i = \beta_0 + \beta_1 x_i.\] \scriptsize ```{r} mod <- glm(y ~ x, family = "poisson") mod ``` ## Fitting a Poisson GLM \scriptsize ```{r, fig.width = 5, fig.height = 4} summary(mod) ``` ## Checking model fit ```{r, fig.width = 5, fig.height = 4} plot(mod, which = 1) ``` ## Checking model fit Are you happy with the model fit? - Yes, the model seems to fit well. - No, there is some problem with the model fit. - Not sure or need more information. ## Plotting the fitted model Adding the fitted mean plus or minus two standard deviations: ```{r, echo = FALSE, fig.width = 5, fig.height = 4} mu_mod <- function(x, mod) { exp(coef(mod)[1] + coef(mod)[2] * x) } sd_mod <- function(x, mod) { sqrt(mu_mod(x, mod)) } plot(y ~ x, col = "gray") curve(mu_mod(x, mod), add = TRUE, lty = 2) curve(mu_mod(x, mod) + 2 * sd_mod(x, mod), add = TRUE, lty = 3) curve(mu_mod(x, mod) - 2 * sd_mod(x, mod), add = TRUE, lty = 3) ``` ## Quasilikelihood approach Model \[E(Y_i) = \mu_i = \exp(\beta_0 + \beta_1 x_i), \quad \text{Var}(Y_i) = \sigma^2 \mu_i\] ```{r} mod_quasi <- glm(y ~ x, family = "quasipoisson") ``` ## Quasilikelihood approach \scriptsize ```{r} summary(mod_quasi) ``` ## Checking model fit Do the assumptions of the quasilikelihood approach seem reasonable here? - Yes, the assumptions are now reasonable. - No, there is some problem with the assumptions. - Not sure or need more information. ## How were the data generated? In fact \[Y_i \mid x_i, z_i \sim \text{Poisson}(\mu_i), \quad \log \mu_i = \beta_0 + \beta_1 x_i + \beta_2 z_i,\] but we **do not observe** the explanatory variable $z_i$ (in this case $z_i \sim N(0, 1)$). We are trying to model the marginal distribution $Y_i \mid x_i$, with particular interest in the marginal mean $\mu(x) = E(Y | x)$ and the marginal variance $\text{Var}(Y | x)$. ## The marginal mean Plotting the true marginal mean and the estimated marginal mean: ```{r, echo = FALSE, fig.width = 5, fig.height = 4} mu_x_each <- function(x) { integrand <- function(z) { exp(beta0 + beta1 * x + beta2 * z) * dnorm(z) } integrate(integrand, -20, 20)$value } mu_x <- function(x) { sapply(x, mu_x_each) } curve(mu_x, from = -2, to = 2, ylab = "E(Y|x)") curve(mu_mod(x, mod), lty = 2, add = TRUE) legend("bottomright", lty = c(1, 2), legend = c("True mean", "Fitted mean"), bty = "n") ``` ## The marginal mean In this case $\log(\mu(x))$ is approximately linear in $x$: ```{r, echo = FALSE, fig.width = 5, fig.height = 4} curve(log(mu_x(x)), from = -2, to = 2, ylab = "log E(Y|x)") curve(log(mu_mod(x, mod)), lty = 2, add = TRUE) legend("bottomright", lty = c(1, 2), legend = c("True log mean", "Fitted log mean"), bty = "n") ``` ## The marginal variance Plotting the true and estimated marginal variances: ```{r, echo = FALSE, fig.width = 5, fig.height = 4} E_Var <- function(x) { integrand <- function(z) { mu <- exp(beta0 + beta1 * x + beta2 * z) mu * dnorm(z) } integrate(integrand, -20, 20)$value } Var_E <- function(x) { integrand <- function(z) { mu <- exp(beta0 + beta1 * x + beta2 * z) mu^2 * dnorm(z) } Emu2 <- integrate(integrand, -20, 20)$value Emu2 - (mu_x(x))^2 } Var_Y_x_each <- function(x) { Var_E(x) + E_Var(x) } Var_Y_x <- function(x) { sapply(x, Var_Y_x_each) } Var_Y_x_mod <- function(x) { mu_x(x) } sigma2_est <- function(object) { df.r <- object$df.residual sum((object$weights * object$residuals^2)[object$weights > 0])/df.r } curve(Var_Y_x, from = -2, to = 2, ylab = "Var(Y|x)", ylim = c(0, 70)) curve(Var_Y_x_mod, col = 2, lty = 2, add = TRUE) sigma2_hat <- sigma2_est(mod_quasi) curve(sigma2_hat * Var_Y_x_mod(x), col = 4, lty = 2, add = TRUE) legend("topleft", lty = c(1, 2, 2), col = c(1, 2, 4), legend = c("Truth", "Fitted (Poisson)", "Fitted (Quasilikelihood)"), bty = "n") ``` ## Ratio of marginal variance The quasilikelihood approach assumes \[\text{Var}(Y | x) = \sigma^2 \mu(x),\] i.e. that the true marginal variance divided by the Poisson model marginal variance \[r(x) = \frac{\text{Var}(Y | x)}{\mu(x)}\] is constant ($\sigma^2$). ## Ratio of marginal variances Plotting the true $r(x)$ with the fitted ratio $\hat \sigma^2$ from quasilikelihood: ```{r, echo = FALSE, fig.width = 5, fig.height = 4} curve(Var_Y_x(x) / Var_Y_x_mod(x), from = -2, to = 2, ylab = "r(x)") abline(h = sigma2_hat, lty = 2) legend("topleft", lty = c(1, 2), legend = c("True ratio", "Fitted ratio (Quasilikelihood)"), bty = "n") ``` ## Summary: overdispersion and quasilikelihood approaches - Overdispersion relative to Poisson model is **very** common in count data. - Failing to observe an important explanatory variable is one cause of overdispersion. - The assumption of quasilikelihood $\text{Var}(Y | x) = \sigma^2 \mu(x)$ might not be accurate (but a big improvement over original model). Questions on lecture content so far?