Lab session 1

Create a new folder for this project. Download infant.xlsx from the course website and put it in this folder. In RStudio, create a new project from the folder you created. Create a new (empty) R script in RStudio and save as “Infant.R” in the project folder.
Use import() from the rio package to import the infant.xlsx data file. Load the dplyr package and convert the imported data frame to a tibble.
The data are from a Child Health and Development Study, corresponding to live births of a single male foetus. Since the data are a subset, some variables are redundant and we shall not consider characteristics of the father here. Use select() to create a new data frame without the variables id to outcome; sex, and drace to dwt. Assign the result to infant2. Use colnames() (from the base package) on infant2 to check the result.

In the following steps we will operate on infant2 to obtain the variables we wish to work with. Unless directed otherwise, update the data frame rather than creating a new one. That is to say, operate on infant2 and assign the result also to infant2; in pseudocode,

infant2 <- operate(infant2, an_argument, another_argument)
The variable gestation gives the length of the pregnancy in days. Filter the data to exclude extremely premature babies (gestation less than 28 weeks) and extremely late babies (gestation more than 52 weeks).
The value 999 has been used to code an unknown value in the wt variable. The replace() function (in the base package) can be used to replace indicated values in a vector, e.g.

replace(x, is.na(x), 0)

takes the variable x and replaces the elements where x is missing, with zero. Using mutate(), update the wt variable in infant2, using replace to take the original wt and replace the elements where wt is equal to 999 with NA.

Use plot() (from base graphics) to plot the child’s weight bwt against the mother’s pre-pregnancy weight wt. Use the data argument to use your updated data frame.
The cut() function (in the base package) can be used to create a factor from a continuous variable, e.g.

cut(x, breaks = c(0, 1, 2, 3))

cuts x into categories \(0 < x \le 1\), \(1 < x \le 2\), and \(2 < x \le 3\). The infant birth weight bwt is given in ounces. Using mutate, create a new factor based on the birth weight variable, by first converting bwt (approximately) to grams through multiplication by 28.35 and then converting the result to a factor with the following categories: \(1500 < x \le 2000\), \(2000 < x \le 2500\), \(2500 < x \le 3000\), \(3000 < x \le 3500\), and \(3500 - 5000\).
Re-write the code for steps 3 - 6 using chaining to create a data frame with all the updates in one go. In other words, create a data pipeline starting with the infant data, piping to select (step 3), then filter (step 4), then mutate (steps 5 and 6). Remember when using the pipe operator you do not need to specify the first argument, as the pipe automatically passes the data to this argument.
An infant is categorised as low weight if its birth weight is \(\le 2500\)grams, regardless of gestation. Pipe the data set created in step 7 to group_by to group the data by the weight factor, then pipe to summarise to count the number of infants in each weight category.

Extra time (optional — if you’re keen!)

Install the forcats package.
The smoke variable is a numeric, but the values correspond to categories as follows

0 = never

1 = smokes now

2 = until current pregnancy

3 = once did, not now

9 = unknown

The fct_collapse function in forcats collapses factor levels into groups, e.g.
```
fct_collapse(race, NULL = "99", white = as.character(0:5),
             mex = "6", black = "7", asian = "8", mixed = "9")        
```
Naming “99” as NULL means the level will be dropped and the values set to NA. Update the dataset created in step 7, using mutate to first convert smoke into a factor (with levels c("0", "1", "2", "3", "9")) and then collapse the levels into nonsmoker (code 0) and smoker (codes 1 to 3), setting unknown to missing.
Pipe the data set created in step 10 to group_by to group the data by both the birth weight factor and the smoking factor, then pipe to summarise to count the number of infants in each crossed category.

We can create a neater table by filtering out the NAs and the using spread from tidyr to spread the counts along columns defined by one of the factors. Look at ?spread to see if you can work out how to do this!

Warwick Data Science Institute

Lab session 1

Data Handling Practical

Heather Turner

12 September 2016

Extra time (optional — if you’re keen!)