Lab session 5
DRY Practical
Heather Turner
28 September 2016
-
Download the
infant_tidy.rds
from the course website and import it. This is a tidied version of the infant data that we used in the Data Handling practical.Use the
complete.cases
function (from the stats package) to create a logical vector indicating cases (rows) with complete data (no missing values).Fit a linear model regressing the infant birth weight,
bwt
, ongestation
and the maternal characteristicswt
,ht
,parity
andrace
, using thesubset
argument to fit the model to the complete cases only.Use
update
to add the smoking variablessmoke
,time
andnumber
. Compare the two models usinganova
. -
The following code plots a histogram of a variable from the
trees
example data set, then overlays a density curve:his <- hist(trees$Girth, freq = FALSE) dens <- density(trees$Girth) ymax <- max(his$density, dens$y) plot(his, freq = FALSE, ylim = c(0, ymax), xlab = "Girth", main = "Histogram of Girth") lines(dens)
This code is provided in the file
Practical_5_Starter_Code.R
on the course website, along with the other code chunks displayed in this worksheet. Run the code to try it out. Use this starter code to create a function enabling you to make this type of plot for any variablex
, with a custom label for the x axis and a custom title.Use your new function to re-create the plot for the
Girth
variable from thetrees
data. Then use it again to create a similar plot for theHeight
variable - make sure you can update the label for the x axis and the title using your function! -
The following code creates a ggplot version of a plot from the R Orientation notes:
library(ggplot2) mycol <- c("blue", "orange", "green") ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + scale_color_manual(values = mycol) + xlab("Petal Length") + ylab("Petal Width") + theme_light() + theme(legend.key = element_blank()) #remove boxes in legend
Use this code to create a function enabling you to make this type of plot for x and y variables from any data set, with any grouping variable and with custom colours. Put the call to
library(ggplot2)
at the start of your function to ensure this package is loaded when the function is called. Useaes_string
to specify the aesthetics. Set the default value for the point colours to the colours used in the starter code.Use your new function to re-create the plot of
Petal.Length
vs.Petal.Width
. Then use it to create a similar plot forSepal.Length
vs.Sepal.Length
. Try changing the colours. -
The football data presented in the talk was originally stored in text files with fixed width columns. Some example files are given on the course website: “2008-9.txt” and “2009-10.txt”. Open one of the files (e.g. in your web browser or in a text editor) to look at the format. This is a fixed width format: each column has 3 characters and spaces delimit the columns. It is tricky to read in because the missing values are given as 3 spaces.
This data can be read in with
import
, but in this case it is easier to use theread_table
function from readrlibrary(readr) library(dplyr) read_table("2008-9.txt") %>% rename(Home = X1)
Since rio depends on readr you will already have readr installed. As there is no column name for the first column,
readr
names thisX1
- the above code renames the column asHome
, since it represents the home team.The scores for each game are spread across multiple columns, with one column per away team. Load the tidyr package, then extend the data pipeline above to gather the scores into a column named
Score
, keyed by a variable namedAway
for the away team. When the home team and away team are the same, the score is an empty character string""
. Continue the pipeline to filter out these values. Finally separate the score into two new variables,Home Score
andAway Score
.Create a new function to run the whole data pipeline, from importing the data to separating the scores. Load readr, tidyr and dplyr inside the function. Write the function with one argument that allows you to change the name of the data file. Test your function on the 2008-9 season data.
Load the purrr package. Use
map_df
to map a list of the names of the example football files to the argument of your new function, so that each data file is read in, processed and added to a combined data frame. Name the arguments of the list passed tomap_df
, so that you can use the.id
argument ofmap_df
to add a column identifying the data for each season.
Optional extra - for those of you that are still keen!
-
Load the knitr and forcats packages. If you did not do the extra activity in Practical 1, you may need to install forcats.
We are going to create some frequency tables of variables in the
infant
data. Thesmoke
variable contains some missing values and we would like to include these in the table. Thefct_explicit_na
function in forcats will expand the levels of a factor to create a new level for the missing values, see the help file for more detail.Start a data pipeline with the
infant
data, then usemutate
to create a new factor based onsmoke
, with an extra level for the missing values. Continuing the pipeline, group by your new factor, then summarise each group, by counting the number of values with the functionn
. End your pipeline with a call tokable
, to create a markdown version of the frequency table, in which first column is left-aligned, the second column is centre-aligned and the columns are given the labels"Smoking history"
and"Count"
.Create a function from your data pipeline so that you can create a kable for any variable, with a custom label for the category column. Use the
rename
trick to be able to use a character string to specify the variable to be tabulated. Use your function to recreate the table for the smoking catgory.Use
pmap
to parallel map values for the two arguments to your function as followsvar label “smoke” “Smoking history” “time” “Time since quitting” “number” “Cigarettes/day”