# Lab session 5

1. Download the infant_tidy.rds from the course website and import it. This is a tidied version of the infant data that we used in the Data Handling practical.

Use the complete.cases function (from the stats package) to create a logical vector indicating cases (rows) with complete data (no missing values).

Fit a linear model regressing the infant birth weight, bwt, on gestation and the maternal characteristics wt, ht, parity and race, using the subset argument to fit the model to the complete cases only.

Use update to add the smoking variables smoke, time and number. Compare the two models using anova.

2. The following code plots a histogram of a variable from the trees example data set, then overlays a density curve:

his <- hist(trees$Girth, freq = FALSE) dens <- density(trees$Girth)
ymax <- max(his$density, dens$y)
plot(his, freq = FALSE, ylim = c(0, ymax), xlab = "Girth",
main = "Histogram of Girth")
lines(dens)      

This code is provided in the file Practical_5_Starter_Code.R on the course website, along with the other code chunks displayed in this worksheet. Run the code to try it out. Use this starter code to create a function enabling you to make this type of plot for any variable x, with a custom label for the x axis and a custom title.

Use your new function to re-create the plot for the Girth variable from the trees data. Then use it again to create a similar plot for the Height variable - make sure you can update the label for the x axis and the title using your function!

3. The following code creates a ggplot version of a plot from the R Orientation notes:

library(ggplot2)
mycol <- c("blue", "orange", "green")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_manual(values = mycol) +
xlab("Petal Length") +
ylab("Petal Width") +
theme_light() +
theme(legend.key = element_blank()) #remove boxes in legend       

Use this code to create a function enabling you to make this type of plot for x and y variables from any data set, with any grouping variable and with custom colours. Put the call to library(ggplot2) at the start of your function to ensure this package is loaded when the function is called. Use aes_string to specify the aesthetics. Set the default value for the point colours to the colours used in the starter code.

Use your new function to re-create the plot of Petal.Length vs. Petal.Width. Then use it to create a similar plot for Sepal.Length vs. Sepal.Length. Try changing the colours.

4. The football data presented in the talk was originally stored in text files with fixed width columns. Some example files are given on the course website: “2008-9.txt” and “2009-10.txt”. Open one of the files (e.g. in your web browser or in a text editor) to look at the format. This is a fixed width format: each column has 3 characters and spaces delimit the columns. It is tricky to read in because the missing values are given as 3 spaces.

This data can be read in with import, but in this case it is easier to use the read_table function from readr

library(readr)
library(dplyr)
rename(Home = X1)      

Since rio depends on readr you will already have readr installed. As there is no column name for the first column, readr names this X1 - the above code renames the column as Home, since it represents the home team.

The scores for each game are spread across multiple columns, with one column per away team. Load the tidyr package, then extend the data pipeline above to gather the scores into a column named Score, keyed by a variable named Away for the away team. When the home team and away team are the same, the score is an empty character string "". Continue the pipeline to filter out these values. Finally separate the score into two new variables, Home Score and Away Score.

Create a new function to run the whole data pipeline, from importing the data to separating the scores. Load readr, tidyr and dplyr inside the function. Write the function with one argument that allows you to change the name of the data file. Test your function on the 2008-9 season data.

Load the purrr package. Use map_df to map a list of the names of the example football files to the argument of your new function, so that each data file is read in, processed and added to a combined data frame. Name the arguments of the list passed to map_df, so that you can use the .id argument of map_dfto add a column identifying the data for each season.

## Optional extra - for those of you that are still keen!

1. Load the knitr and forcats packages. If you did not do the extra activity in Practical 1, you may need to install forcats.

We are going to create some frequency tables of variables in the infant data. The smoke variable contains some missing values and we would like to include these in the table. The fct_explicit_na function in forcats will expand the levels of a factor to create a new level for the missing values, see the help file for more detail.

Start a data pipeline with the infant data, then use mutate to create a new factor based on smoke, with an extra level for the missing values. Continuing the pipeline, group by your new factor, then summarise each group, by counting the number of values with the function n. End your pipeline with a call to kable, to create a markdown version of the frequency table, in which first column is left-aligned, the second column is centre-aligned and the columns are given the labels "Smoking history" and "Count".

Create a function from your data pipeline so that you can create a kable for any variable, with a custom label for the category column. Use the rename trick to be able to use a character string to specify the variable to be tabulated. Use your function to recreate the table for the smoking catgory.

Use pmap to parallel map values for the two arguments to your function as follows

var label
“smoke” “Smoking history”
“time” “Time since quitting”
“number” “Cigarettes/day”