Lab Session 2
Data Handling Practical
Heather Turner
12 September 2017
-
Create a new folder for this practical. Download
infant.xlsx
from the course website and put it in this folder. In RStudio, create a new project from the folder you created. Create a new (empty) R markdown file in RStudio and save as “Infant.Rmd” in the project folder. As you work through the remainder of this practical, create an R chunk for each part and write notes in the.Rmd
file on what the code is doing. -
Use
import()
from the rio package to import theinfant.xlsx
data file as a tibble. -
The data are from a Child Health and Development Study, corresponding to live births of a single male foetus. Since the data are a subset, some variables are redundant and we shall not consider characteristics of the father here. Use
select()
from dplyr to create a new data frame without the variablesid
tooutcome
;sex
, anddrace
todwt
. Assign the result toinfant2
. Usecolnames()
(from the base package) oninfant2
to check the result.In the following steps we will operate on
infant2
to obtain the variables we wish to work with. Unless directed otherwise, update the data frame rather than creating a new one. That is to say, operate oninfant2
and assign the result also toinfant2
; in pseudocode,infant2 <- operate(infant2, an_argument, another_argument)
-
The variable
gestation
gives the length of the pregnancy in days. Filter the data to exclude extremely premature babies (gestation less than 28 weeks) and extremely late babies (gestation more than 52 weeks). -
The value
999
has been used to code an unknown value in thewt
variable. Thereplace()
function (in the base package) can be used to replace indicated values in a vector, e.g.replace(x, is.na(x), 0)
takes the variable
x
and replaces the elements wherex
is missing, with zero. Usingmutate()
, update thewt
variable ininfant2
, usingreplace
to take the originalwt
and replace the elements wherewt
is equal to999
withNA
.Use
plot()
(from base graphics) to plot the child’s weightbwt
against the mother’s pre-pregnancy weightwt
. Use thedata
argument to use your updated data frame. -
The
cut()
function (in the base package) can be used to create a factor from a continuous variable, e.g.cut(x, breaks = c(0, 1, 2, 3))
cuts
x
into categories 0 < x ≤ 1, 1 < x ≤ 2, and 2 < x ≤ 3. The infant birth weightbwt
is given in ounces. Usingmutate
, create a new factor based on the birth weight variable, by first convertingbwt
(approximately) to grams through multiplication by 28.35 and then converting the result to a factor with the following categories: (1500, 2000], (2000, 2500], (2500, 3000], (3000, 3500], and (3500, 5000]. -
Re-write the code for steps 3 - 6 using chaining to create a data frame with all the updates in one go. In other words, create a data pipeline starting with the
infant
data, piping toselect
(step 3), thenfilter
(step 4), thenmutate
(steps 5 and 6). Remember when using the pipe operator you do not need to specify the first argument, as the pipe automatically passes the data to this argument. -
An infant is categorised as low weight if its birth weight is ≤ 2500 grams, regardless of gestation. Pipe the data set created in step 7 to
group_by
to group the data by the weight factor, then pipe tosummarise
to count the number of infants in each weight category. -
Try knitting your
.Rmd
file to give a basic report of your work!
Extra time (optional — if you’re keen!)
-
Install the forcats package.
-
The
smoke
variable is a numeric, but the values correspond to categories as follows0 = never
1 = smokes now
2 = until current pregnancy
3 = once did, not now
9 = unknown
The
fct_collapse
function in forcats collapses factor levels into groups, e.g.fct_collapse(race, NULL = "99", white = as.character(0:5), mex = "6", black = "7", asian = "8", mixed = "9")
Naming “99” as
NULL
means the level will be dropped and the values set toNA
. Update the dataset created in step 7, usingmutate
to first convertsmoke
into a factor (with levelsc("0", "1", "2", "3", "9")
) and then collapse the levels into nonsmoker (code 0) and smoker (codes 1 to 3), setting unknown to missing. -
Pipe the data set created in step 10 to
group_by
to group the data by both the birth weight factor and the smoking factor, then pipe tosummarise
to count the number of infants in each crossed category.We can create a neater table by filtering out the
NA
s and then usingspread
from tidyr to spread the counts along columns defined by one of the factors. Look at?spread
to see if you can work out how to do this!