Data Handling Practical
12 September 2016
Create a new folder for this project. Download
infant.xlsxfrom the course website and put it in this folder. In RStudio, create a new project from the folder you created. Create a new (empty) R script in RStudio and save as “Infant.R” in the project folder.
import()from the rio package to import the
infant.xlsxdata file. Load the dplyr package and convert the imported data frame to a tibble.
The data are from a Child Health and Development Study, corresponding to live births of a single male foetus. Since the data are a subset, some variables are redundant and we shall not consider characteristics of the father here. Use
select()to create a new data frame without the variables
dwt. Assign the result to
colnames()(from the base package) on
infant2to check the result.
In the following steps we will operate on
infant2to obtain the variables we wish to work with. Unless directed otherwise, update the data frame rather than creating a new one. That is to say, operate on
infant2and assign the result also to
infant2; in pseudocode,
infant2 <- operate(infant2, an_argument, another_argument)
gestationgives the length of the pregnancy in days. Filter the data to exclude extremely premature babies (gestation less than 28 weeks) and extremely late babies (gestation more than 52 weeks).
999has been used to code an unknown value in the
replace()function (in the base package) can be used to replace indicated values in a vector, e.g.
replace(x, is.na(x), 0)
takes the variable
xand replaces the elements where
xis missing, with zero. Using
mutate(), update the
replaceto take the original
wtand replace the elements where
wtis equal to
plot()(from base graphics) to plot the child’s weight
bwtagainst the mother’s pre-pregnancy weight
wt. Use the
dataargument to use your updated data frame.
cut()function (in the base package) can be used to create a factor from a continuous variable, e.g.
cut(x, breaks = c(0, 1, 2, 3))
xinto categories \(0 < x \le 1\), \(1 < x \le 2\), and \(2 < x \le 3\). The infant birth weight
bwtis given in ounces. Using
mutate, create a new factor based on the birth weight variable, by first converting
bwt(approximately) to grams through multiplication by 28.35 and then converting the result to a factor with the following categories: \(1500 < x \le 2000\), \(2000 < x \le 2500\), \(2500 < x \le 3000\), \(3000 < x \le 3500\), and \(3500 - 5000\).
Re-write the code for steps 3 - 6 using chaining to create a data frame with all the updates in one go. In other words, create a data pipeline starting with the
infantdata, piping to
select(step 3), then
filter(step 4), then
mutate(steps 5 and 6). Remember when using the pipe operator you do not need to specify the first argument, as the pipe automatically passes the data to this argument.
An infant is categorised as low weight if its birth weight is \(\le 2500\)grams, regardless of gestation. Pipe the data set created in step 7 to
group_byto group the data by the weight factor, then pipe to
summariseto count the number of infants in each weight category.
Extra time (optional — if you’re keen!)
Install the forcats package.
smokevariable is a numeric, but the values correspond to categories as follows
0 = never
1 = smokes now
2 = until current pregnancy
3 = once did, not now
9 = unknown
fct_collapsefunction in forcats collapses factor levels into groups, e.g.
fct_collapse(race, NULL = "99", white = as.character(0:5), mex = "6", black = "7", asian = "8", mixed = "9")
Naming “99” as
NULLmeans the level will be dropped and the values set to
NA. Update the dataset created in step 7, using
mutateto first convert
smokeinto a factor (with levels
c("0", "1", "2", "3", "9")) and then collapse the levels into nonsmoker (code 0) and smoker (codes 1 to 3), setting unknown to missing.
Pipe the data set created in step 10 to
group_byto group the data by both the birth weight factor and the smoking factor, then pipe to
summariseto count the number of infants in each crossed category.
We can create a neater table by filtering out the
NAs and the using
spreadfrom tidyr to spread the counts along columns defined by one of the factors. Look at
?spreadto see if you can work out how to do this!