---
title: "APTS Statistical Modelling"
subtitle: "Lecture Discussion 1"
date: "April 2021"
author: Helen Ogden
output:
beamer_presentation:
includes:
in_header: preamble.tex
---
## Lecture discussion 1
> All models are wrong, but some models are useful
>
> -- *George Box (1919--2013)*
- Please mute your microphone and turn off video, unless you are asking a question.
- Feel free to ask questions about any of the lecture content so far.
- Use the chat to ask questions directly, or to state that you have a question.
- For the two computer labs, if you would like to be assigned to a group, request this on the
Statistical Modelling Teams channel by 13:00 today.
## How should we model this data?
```{r, echo = FALSE, fig.width = 5, fig.height = 4}
set.seed(1)
n <- 100
x <- rnorm(n)
beta0 <- 0.5
beta1 <- 1.2
mu <- beta1 * x + beta0
sigma <- 0.5
a <- sigma * sqrt(12)/2
epsilon <- runif(100, -a, a)
y <- mu + epsilon
plot(y ~ x)
```
## Fitting a linear model
We can fit the normal linear model
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]
```{r}
mod <- lm(y ~ x)
mod
```
## Checking model fit
```{r, fig.width = 5, fig.height = 4, echo = FALSE}
plot(mod, which = 1)
```
## Checking model fit
```{r, fig.width = 5, fig.height = 4, echo = FALSE}
plot(mod, which = 2)
```
## Checking model fit
Are you happy with the model fit?
- Yes, the model seems to fit well.
- No, there is some problem with the model fit.
- Not sure or need more information.
## Robustness to model misspecification
In reality
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim U(-a, a),\]
(with $\beta_0 = 0.5$, $\beta_1 = 1.2$ and $a = 0.87$, chosen so that
$\text{var}(\epsilon_i) = 0.25$.)
We fit the normal linear model
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]
\pause
Will our inference be correct, even though the model is wrong?
- Yes, the inference will be correct because the linear model is robust to misspecification of the error distribution.
- No, the model was incorrectly specified so the inference will be incorrect.
- Not sure or need more information.
## Estimating the mean response
If we are just interested in the mean response, our fitted
mean
\[\hat \mu (x) = \hat \beta_0 + \hat \beta_1 x\]
is a consistent estimator of the true mean $\mu(x)$.
```{r, echo = FALSE, fig.width = 5, fig.height = 4}
plot(y ~ x, col = "gray")
abline(mod, lty = 1)
abline(a = beta0, b = beta1, lty = 2, col = 2)
legend("bottomright", lty = c(1, 2), col = c(1, 2), legend = c("Fitted mean", "True mean"))
```
## Estimating tail probabilities
If we are interested in some aspect other than the mean response,
the model misspecification matters. e.g. if we estimate
$P(Y > 1 | x)$, we will no longer get a consistent estimator from
our model.
```{r, echo = FALSE, fig.width = 5, fig.height = 4}
tail_prob_true <- function(x) {
punif(1 - beta0 - beta1 * x, min = -a, max = a, lower.tail = FALSE)
}
tail_prob_est <- function(x) {
pnorm(1 - coef(mod)[1] - coef(mod)[2] * x, sd = sigma_hat, lower.tail = FALSE)
}
beta0_hat <- coef(mod)[1]
beta1_hat <- coef(mod)[2]
sigma_hat <- sigma(mod)
tail_prob_limit <- function(x) {
pnorm(1 - beta0 - beta1 * x, sd = sigma, lower.tail = FALSE)
}
curve(tail_prob_est, from = -2, to = 2, xlab = "x", ylab = "P(Y > 1 | x)")
curve(tail_prob_true, lty = 2, col = 2, add = TRUE)
curve(tail_prob_limit, lty = 3, col = 1, add = TRUE)
legend("bottomright", lty = c(1, 2, 3), col = c(1, 2, 1), legend = c("Fitted", "True", "Limit"))
```
## Comparing against a quadratic model
We could compare against a linear model with quadratic dependence on $x$:
\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]
```{r}
mod2 <- lm(y ~ poly(x, 2))
c(AIC(mod), AIC(mod2))
```
\pause
We prefer the simpler model, but that should **not** be used as evidence that
the simpler model is correct.
## Summary: impacts of model misspecification
- Robustness to model misspecification is dependent on the quantity of interest.
- Even if a model is very close to the truth (according to KL divergence), it can
still give incorrect answers about some quantities of interest.
- Model selection criteria are not a substitute for model checking.