# Summarising Data In R

R contains a number of simple summary statistics for use on R objects. Typical functions are `length()`, `mean()`, `median()`, `max()`, `min()`, `range()` and `summary()`. For example:

> x <- c(5,3,3,6,7,9,2,4,1,9,7,9,4,4,8) > length(x) [1] 15 > mean(x) [1] 5.4 > median(x) [1] 5 > min(x) [1] 1 > max(x) [1] 9 > range(x) [1] 1 9

On simple vectors, `summary()` gives the range, median, mean and interquartile range:

> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 3.5 5.0 5.4 7.5 9.0

The `summary()` command can also be used on data frames and other more complicated objects too. For example, here is the Santa Claus example once again:

> summary(Santa) Believe Age Gender Presents Behaviour Mode :logical Min. : 4.00 female:23 Min. : 3.0 bad :26 FALSE:25 1st Qu.: 5.00 male :27 1st Qu.:20.0 good:24 TRUE :25 Median : 7.00 Median :26.5 Mean : 6.86 Mean :27.0 3rd Qu.: 9.00 3rd Qu.:33.5 Max. :10.00 Max. :57.0

Other useful commands for matrices and data frames include `dim()` (what are the dimensions), `rownames()` and `colnames()` to find out (or set) the row and column names:

> dim(Santa) [1] 50 5 > colnames(Santa) [1] "Believe" "Age" "Gender" "Presents" "Behaviour"

## Standard Deviation & Variance

You get the standard deviation of a vector `x` with `sd(x)`, or its variance with `var(x)`. For example, using the ages in the `Santa` data:

> sd(Santa$Age) [1] 2.11901 > var(Santa$Age) [1] 4.490204

We can also check that squaring the standard deviation gives the variance:

> sd(Santa$Age) ^ 2 [1] 4.490204

What if we wanted to break up the data by gender?

> table(Santa$Gender) female male 23 27

On the page about manipulating data you where shown that a list of logical values could be used to select elements from a matrix or data frame. We can use this to extract only the "male" data.

> Santa$Gender == "male" [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE [10] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE [19] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE [28] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE [37] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE [46] TRUE FALSE FALSE TRUE TRUE > Santa[Santa$Gender == "male",] Believe Age Gender Presents Behaviour 1 FALSE 9 male 25 naughty 2 TRUE 5 male 20 nice 4 TRUE 4 male 34 naughty ... 46 TRUE 4 male 34 naughty 49 TRUE 10 male 57 nice 50 FALSE 4 male 3 naughty

In particular, we can get the boy's ages like this:

> Santa[Santa$Gender == "male","Age"] [1] 9 5 4 4 6 5 7 5 5 4 8 8 9 4 10 7 5 6 [19] 7 8 4 8 9 6 4 10 4

Or like this, which you might find clearer:

> Santa$Age[Santa$Gender == "male"] [1] 9 5 4 4 6 5 7 5 5 4 8 8 9 4 10 7 5 6 [19] 7 8 4 8 9 6 4 10 4

Now we have a vector of the boys ages, it is trivial to get their standard deviation and a quick summary of distribution:

> sd(Santa[Santa$Gender == "male","Age"]) [1] 2.038099 > summary(Santa[Santa$Gender == "male","Age"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.500 6.000 6.333 8.000 10.000

For the girls this becomes:

> sd(Santa[Santa$Gender == "female","Age"]) [1] 2.086092 > summary(Santa[Santa$Gender == "female","Age"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 5.500 8.000 7.478 9.000 10.000

As you can see, while the range and the standard deviation are about the same for the two groups, but the girls seem to be older.

That code is a bit long, how about this instead?

> tapply(Santa$Age, Santa$Gender, sd) female male 2.086092 2.038099

or using `summary()` we get:

> tapply(Santa$Age, Santa$Gender, summary) $female Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 5.500 8.000 7.478 9.000 10.000 $male Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.500 6.000 6.333 8.000 10.000

This used the "table apply" function, `tapply()`, to apply the function `sd()` or `summary()` to the vector `Santa$Age` when broken up into a table using the factor vector `Santa$Gender` (!).