---
title: "APTS Statistical Modelling"
subtitle: "Lecture Discussion 2"
date: "April 2021"
author: Helen Ogden
output:
  beamer_presentation:
    includes:
      in_header: preamble.tex
---

## Lecture discussion 2

- Please mute your microphone and turn off video, unless you are asking a question.
- Feel free to ask questions about any of the lecture content so far.
- Use the chat to ask questions directly, or to state that you have a question.

## How should we model this data?

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
set.seed(1)
n <- 200

x <- seq(-2, 2, length.out = n)
z <- rnorm(n)

beta0 <- 2
beta1 <- 0.25
beta2 <- 0.5
eta <- beta0 + beta1 * x + beta2 * z
mu <- exp(eta)
y <- rpois(n, mu)

plot(y ~ x)
```

## Fitting a Poisson GLM

We can fit the Poisson log-linear model
\[Y_i \sim \text{Poisson}(\mu_i), \quad \log \mu_i = \beta_0 + \beta_1 x_i.\]

\scriptsize
```{r}
mod <- glm(y ~ x, family = "poisson")
mod
```
## Fitting a Poisson GLM

\scriptsize
```{r, fig.width = 5, fig.height = 4}
summary(mod)
```

## Checking model fit

```{r, fig.width = 5, fig.height = 4}
plot(mod, which = 1)
```

## Checking model fit

Are you happy with the model fit?

- Yes, the model seems to fit well.
- No, there is some problem with the model fit.
- Not sure or need more information.


## Plotting the fitted model

Adding the fitted mean plus or minus two standard deviations:

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
mu_mod <- function(x, mod) {
  exp(coef(mod)[1] + coef(mod)[2] * x)
}
sd_mod <- function(x, mod) {
    sqrt(mu_mod(x, mod))
}

plot(y ~ x, col = "gray")
curve(mu_mod(x, mod), add = TRUE, lty = 2)
curve(mu_mod(x, mod) + 2 * sd_mod(x, mod), add = TRUE, lty = 3)
curve(mu_mod(x, mod) - 2 * sd_mod(x, mod), add = TRUE, lty = 3)
```

## Quasilikelihood approach

Model \[E(Y_i) = \mu_i = \exp(\beta_0 + \beta_1 x_i), \quad \text{Var}(Y_i) = \sigma^2 \mu_i\]

```{r}
mod_quasi <- glm(y ~ x, family = "quasipoisson")
```

## Quasilikelihood approach

\scriptsize
```{r}
summary(mod_quasi)
```

## Checking model fit

Do the assumptions of the quasilikelihood approach seem reasonable here?

- Yes, the assumptions are now reasonable.
- No, there is some problem with the assumptions.
- Not sure or need more information.

## How were the data generated?

In fact
\[Y_i \mid x_i, z_i \sim \text{Poisson}(\mu_i), \quad \log \mu_i = \beta_0 + \beta_1 x_i + \beta_2 z_i,\]
but we **do not observe** the explanatory variable $z_i$ (in this case $z_i \sim N(0, 1)$).

We are trying to model the marginal distribution $Y_i \mid x_i$, with
particular interest in the marginal mean $\mu(x) = E(Y | x)$ and the marginal variance
$\text{Var}(Y | x)$.

## The marginal mean

Plotting the true marginal mean and the estimated marginal mean:

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
mu_x_each <- function(x) {
    integrand <- function(z) {
        exp(beta0 + beta1 * x + beta2 * z) * dnorm(z)
    }
    integrate(integrand, -20, 20)$value
}
mu_x <- function(x) {
    sapply(x, mu_x_each)
}

curve(mu_x, from = -2, to = 2, ylab = "E(Y|x)")
curve(mu_mod(x, mod), lty = 2, add = TRUE)
legend("bottomright", lty = c(1, 2), legend = c("True mean", "Fitted mean"),
       bty = "n")
```

## The marginal mean

In this case $\log(\mu(x))$ is approximately linear in $x$:

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
curve(log(mu_x(x)), from = -2, to = 2, ylab = "log E(Y|x)")
curve(log(mu_mod(x, mod)), lty = 2, add = TRUE)
legend("bottomright", lty = c(1, 2), legend = c("True log mean", "Fitted log mean"),
       bty = "n")
```

## The marginal variance

Plotting the true and estimated marginal variances:

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
E_Var <- function(x) {
    integrand <- function(z) {
        mu <- exp(beta0 + beta1 * x + beta2 * z)
        mu  * dnorm(z)
    }
    integrate(integrand, -20, 20)$value
}

Var_E <- function(x) {
    integrand <- function(z) {
        mu <- exp(beta0 + beta1 * x + beta2 * z)
        mu^2  * dnorm(z)
    }
    Emu2 <- integrate(integrand, -20, 20)$value
    Emu2 - (mu_x(x))^2
}

Var_Y_x_each <- function(x) {
    Var_E(x) + E_Var(x)
}

Var_Y_x <- function(x) {
    sapply(x, Var_Y_x_each)
}

Var_Y_x_mod <- function(x) {
    mu_x(x)
}


sigma2_est <- function(object) {
    df.r <- object$df.residual
    sum((object$weights * object$residuals^2)[object$weights >  0])/df.r
}


curve(Var_Y_x, from = -2, to = 2, ylab = "Var(Y|x)", ylim = c(0, 70))
curve(Var_Y_x_mod, col = 2, lty = 2, add = TRUE)
sigma2_hat <- sigma2_est(mod_quasi)
curve(sigma2_hat * Var_Y_x_mod(x), col = 4, lty = 2, add = TRUE)
legend("topleft", lty = c(1, 2, 2), col = c(1, 2, 4),
       legend = c("Truth", "Fitted (Poisson)", "Fitted (Quasilikelihood)"),
       bty = "n")
```

## Ratio of marginal variance

The quasilikelihood approach assumes \[\text{Var}(Y | x) = \sigma^2 \mu(x),\]
i.e. that
the true marginal variance divided by the Poisson model marginal
variance \[r(x) = \frac{\text{Var}(Y | x)}{\mu(x)}\]
is constant ($\sigma^2$).

## Ratio of marginal variances

Plotting the true $r(x)$ with the fitted ratio $\hat \sigma^2$ from quasilikelihood:

```{r, echo = FALSE, fig.width = 5, fig.height = 4}
curve(Var_Y_x(x) / Var_Y_x_mod(x), from = -2, to = 2, ylab = "r(x)")
abline(h = sigma2_hat, lty = 2)
legend("topleft", lty = c(1, 2),
       legend = c("True ratio", "Fitted ratio (Quasilikelihood)"),
       bty = "n")
```

## Summary: overdispersion and quasilikelihood approaches

- Overdispersion relative to Poisson model is **very** common in count data.
- Failing to observe an important explanatory variable is one cause of overdispersion.
- The assumption of quasilikelihood $\text{Var}(Y | x) = \sigma^2 \mu(x)$ might not be accurate (but a big improvement over original model).

Questions on lecture content so far?