Lab session 5
DRY Practical
Heather Turner
28 September 2016
-
Download the
infant_tidy.rdsfrom the course website and import it. This is a tidied version of the infant data that we used in the Data Handling practical.Use the
complete.casesfunction (from the stats package) to create a logical vector indicating cases (rows) with complete data (no missing values).Fit a linear model regressing the infant birth weight,
bwt, ongestationand the maternal characteristicswt,ht,parityandrace, using thesubsetargument to fit the model to the complete cases only.Use
updateto add the smoking variablessmoke,timeandnumber. Compare the two models usinganova. -
The following code plots a histogram of a variable from the
treesexample data set, then overlays a density curve:his <- hist(trees$Girth, freq = FALSE) dens <- density(trees$Girth) ymax <- max(his$density, dens$y) plot(his, freq = FALSE, ylim = c(0, ymax), xlab = "Girth", main = "Histogram of Girth") lines(dens)This code is provided in the file
Practical_5_Starter_Code.Ron the course website, along with the other code chunks displayed in this worksheet. Run the code to try it out. Use this starter code to create a function enabling you to make this type of plot for any variablex, with a custom label for the x axis and a custom title.Use your new function to re-create the plot for the
Girthvariable from thetreesdata. Then use it again to create a similar plot for theHeightvariable - make sure you can update the label for the x axis and the title using your function! -
The following code creates a ggplot version of a plot from the R Orientation notes:
library(ggplot2) mycol <- c("blue", "orange", "green") ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + scale_color_manual(values = mycol) + xlab("Petal Length") + ylab("Petal Width") + theme_light() + theme(legend.key = element_blank()) #remove boxes in legendUse this code to create a function enabling you to make this type of plot for x and y variables from any data set, with any grouping variable and with custom colours. Put the call to
library(ggplot2)at the start of your function to ensure this package is loaded when the function is called. Useaes_stringto specify the aesthetics. Set the default value for the point colours to the colours used in the starter code.Use your new function to re-create the plot of
Petal.Lengthvs.Petal.Width. Then use it to create a similar plot forSepal.Lengthvs.Sepal.Length. Try changing the colours. -
The football data presented in the talk was originally stored in text files with fixed width columns. Some example files are given on the course website: “2008-9.txt” and “2009-10.txt”. Open one of the files (e.g. in your web browser or in a text editor) to look at the format. This is a fixed width format: each column has 3 characters and spaces delimit the columns. It is tricky to read in because the missing values are given as 3 spaces.
This data can be read in with
import, but in this case it is easier to use theread_tablefunction from readrlibrary(readr) library(dplyr) read_table("2008-9.txt") %>% rename(Home = X1)Since rio depends on readr you will already have readr installed. As there is no column name for the first column,
readrnames thisX1- the above code renames the column asHome, since it represents the home team.The scores for each game are spread across multiple columns, with one column per away team. Load the tidyr package, then extend the data pipeline above to gather the scores into a column named
Score, keyed by a variable namedAwayfor the away team. When the home team and away team are the same, the score is an empty character string"". Continue the pipeline to filter out these values. Finally separate the score into two new variables,Home ScoreandAway Score.Create a new function to run the whole data pipeline, from importing the data to separating the scores. Load readr, tidyr and dplyr inside the function. Write the function with one argument that allows you to change the name of the data file. Test your function on the 2008-9 season data.
Load the purrr package. Use
map_dfto map a list of the names of the example football files to the argument of your new function, so that each data file is read in, processed and added to a combined data frame. Name the arguments of the list passed tomap_df, so that you can use the.idargument ofmap_dfto add a column identifying the data for each season.
Optional extra - for those of you that are still keen!
-
Load the knitr and forcats packages. If you did not do the extra activity in Practical 1, you may need to install forcats.
We are going to create some frequency tables of variables in the
infantdata. Thesmokevariable contains some missing values and we would like to include these in the table. Thefct_explicit_nafunction in forcats will expand the levels of a factor to create a new level for the missing values, see the help file for more detail.Start a data pipeline with the
infantdata, then usemutateto create a new factor based onsmoke, with an extra level for the missing values. Continuing the pipeline, group by your new factor, then summarise each group, by counting the number of values with the functionn. End your pipeline with a call tokable, to create a markdown version of the frequency table, in which first column is left-aligned, the second column is centre-aligned and the columns are given the labels"Smoking history"and"Count".Create a function from your data pipeline so that you can create a kable for any variable, with a custom label for the category column. Use the
renametrick to be able to use a character string to specify the variable to be tabulated. Use your function to recreate the table for the smoking catgory.Use
pmapto parallel map values for the two arguments to your function as followsvar label “smoke” “Smoking history” “time” “Time since quitting” “number” “Cigarettes/day”