Lab session 1
Data Handling Practical
Heather Turner
12 September 2016
-
Create a new folder for this project. Download
infant.xlsxfrom the course website and put it in this folder. In RStudio, create a new project from the folder you created. Create a new (empty) R script in RStudio and save as “Infant.R” in the project folder. -
Use
import()from the rio package to import theinfant.xlsxdata file. Load the dplyr package and convert the imported data frame to a tibble. -
The data are from a Child Health and Development Study, corresponding to live births of a single male foetus. Since the data are a subset, some variables are redundant and we shall not consider characteristics of the father here. Use
select()to create a new data frame without the variablesidtooutcome;sex, anddracetodwt. Assign the result toinfant2. Usecolnames()(from the base package) oninfant2to check the result.In the following steps we will operate on
infant2to obtain the variables we wish to work with. Unless directed otherwise, update the data frame rather than creating a new one. That is to say, operate oninfant2and assign the result also toinfant2; in pseudocode,infant2 <- operate(infant2, an_argument, another_argument)
-
The variable
gestationgives the length of the pregnancy in days. Filter the data to exclude extremely premature babies (gestation less than 28 weeks) and extremely late babies (gestation more than 52 weeks). -
The value
999has been used to code an unknown value in thewtvariable. Thereplace()function (in the base package) can be used to replace indicated values in a vector, e.g.replace(x, is.na(x), 0)
takes the variable
xand replaces the elements wherexis missing, with zero. Usingmutate(), update thewtvariable ininfant2, usingreplaceto take the originalwtand replace the elements wherewtis equal to999withNA.Use
plot()(from base graphics) to plot the child’s weightbwtagainst the mother’s pre-pregnancy weightwt. Use thedataargument to use your updated data frame. -
The
cut()function (in the base package) can be used to create a factor from a continuous variable, e.g.cut(x, breaks = c(0, 1, 2, 3))
cuts
xinto categories \(0 < x \le 1\), \(1 < x \le 2\), and \(2 < x \le 3\). The infant birth weightbwtis given in ounces. Usingmutate, create a new factor based on the birth weight variable, by first convertingbwt(approximately) to grams through multiplication by 28.35 and then converting the result to a factor with the following categories: \(1500 < x \le 2000\), \(2000 < x \le 2500\), \(2500 < x \le 3000\), \(3000 < x \le 3500\), and \(3500 - 5000\). -
Re-write the code for steps 3 - 6 using chaining to create a data frame with all the updates in one go. In other words, create a data pipeline starting with the
infantdata, piping toselect(step 3), thenfilter(step 4), thenmutate(steps 5 and 6). Remember when using the pipe operator you do not need to specify the first argument, as the pipe automatically passes the data to this argument. -
An infant is categorised as low weight if its birth weight is \(\le 2500\)grams, regardless of gestation. Pipe the data set created in step 7 to
group_byto group the data by the weight factor, then pipe tosummariseto count the number of infants in each weight category.
Extra time (optional — if you’re keen!)
-
Install the forcats package.
-
The
smokevariable is a numeric, but the values correspond to categories as follows0 = never
1 = smokes now
2 = until current pregnancy
3 = once did, not now
9 = unknown
The
fct_collapsefunction in forcats collapses factor levels into groups, e.g.fct_collapse(race, NULL = "99", white = as.character(0:5), mex = "6", black = "7", asian = "8", mixed = "9")Naming “99” as
NULLmeans the level will be dropped and the values set toNA. Update the dataset created in step 7, usingmutateto first convertsmokeinto a factor (with levelsc("0", "1", "2", "3", "9")) and then collapse the levels into nonsmoker (code 0) and smoker (codes 1 to 3), setting unknown to missing. -
Pipe the data set created in step 10 to
group_byto group the data by both the birth weight factor and the smoking factor, then pipe tosummariseto count the number of infants in each crossed category.We can create a neater table by filtering out the
NAs and the usingspreadfrom tidyr to spread the counts along columns defined by one of the factors. Look at?spreadto see if you can work out how to do this!