Summarising Data In R
R contains a number of simple summary statistics for use on R objects. Typical functions are length(), mean(), median(), max(), min(), range() and summary(). For example:
> x <- c(5,3,3,6,7,9,2,4,1,9,7,9,4,4,8) > length(x) [1] 15 > mean(x) [1] 5.4 > median(x) [1] 5 > min(x) [1] 1 > max(x) [1] 9 > range(x) [1] 1 9
On simple vectors, summary() gives the range, median, mean and interquartile range:
> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 3.5 5.0 5.4 7.5 9.0
The summary() command can also be used on data frames and other more complicated objects too. For example, here is the Santa Claus example once again:
> summary(Santa) Believe Age Gender Presents Behaviour Mode :logical Min. : 4.00 female:23 Min. : 3.0 bad :26 FALSE:25 1st Qu.: 5.00 male :27 1st Qu.:20.0 good:24 TRUE :25 Median : 7.00 Median :26.5 Mean : 6.86 Mean :27.0 3rd Qu.: 9.00 3rd Qu.:33.5 Max. :10.00 Max. :57.0
Other useful commands for matrices and data frames include dim() (what are the dimensions), rownames() and colnames() to find out (or set) the row and column names:
> dim(Santa) [1] 50 5 > colnames(Santa) [1] "Believe" "Age" "Gender" "Presents" "Behaviour"
Standard Deviation & Variance
You get the standard deviation of a vector x with sd(x), or its variance with var(x). For example, using the ages in the Santa data:
> sd(Santa$Age) [1] 2.11901 > var(Santa$Age) [1] 4.490204
We can also check that squaring the standard deviation gives the variance:
> sd(Santa$Age) ^ 2 [1] 4.490204
What if we wanted to break up the data by gender?
> table(Santa$Gender) female male 23 27
On the page about manipulating data you where shown that a list of logical values could be used to select elements from a matrix or data frame. We can use this to extract only the "male" data.
> Santa$Gender == "male" [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE [10] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE [19] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE [28] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE [37] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE [46] TRUE FALSE FALSE TRUE TRUE > Santa[Santa$Gender == "male",] Believe Age Gender Presents Behaviour 1 FALSE 9 male 25 naughty 2 TRUE 5 male 20 nice 4 TRUE 4 male 34 naughty ... 46 TRUE 4 male 34 naughty 49 TRUE 10 male 57 nice 50 FALSE 4 male 3 naughty
In particular, we can get the boy's ages like this:
> Santa[Santa$Gender == "male","Age"] [1] 9 5 4 4 6 5 7 5 5 4 8 8 9 4 10 7 5 6 [19] 7 8 4 8 9 6 4 10 4
Or like this, which you might find clearer:
> Santa$Age[Santa$Gender == "male"] [1] 9 5 4 4 6 5 7 5 5 4 8 8 9 4 10 7 5 6 [19] 7 8 4 8 9 6 4 10 4
Now we have a vector of the boys ages, it is trivial to get their standard deviation and a quick summary of distribution:
> sd(Santa[Santa$Gender == "male","Age"]) [1] 2.038099 > summary(Santa[Santa$Gender == "male","Age"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.500 6.000 6.333 8.000 10.000
For the girls this becomes:
> sd(Santa[Santa$Gender == "female","Age"]) [1] 2.086092 > summary(Santa[Santa$Gender == "female","Age"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 5.500 8.000 7.478 9.000 10.000
As you can see, while the range and the standard deviation are about the same for the two groups, but the girls seem to be older.
That code is a bit long, how about this instead?
> tapply(Santa$Age, Santa$Gender, sd) female male 2.086092 2.038099
or using summary() we get:
> tapply(Santa$Age, Santa$Gender, summary) $female Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 5.500 8.000 7.478 9.000 10.000 $male Min. 1st Qu. Median Mean 3rd Qu. Max. 4.000 4.500 6.000 6.333 8.000 10.000
This used the "table apply" function, tapply(), to apply the function sd() or summary() to the vector Santa$Age when broken up into a table using the factor vector Santa$Gender (!).