Summarising Data In R

R contains a number of simple summary statistics for use on R objects. Typical functions are length(), mean(), median(), max(), min(), range() and summary(). For example:

> x <- c(5,3,3,6,7,9,2,4,1,9,7,9,4,4,8)
> length(x)
[1] 15
> mean(x)
[1] 5.4
> median(x)
[1] 5
> min(x)
[1] 1
> max(x)
[1] 9
> range(x)
[1] 1 9

On simple vectors, summary() gives the range, median, mean and interquartile range:

> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     3.5     5.0     5.4     7.5     9.0

The summary() command can also be used on data frames and other more complicated objects too. For example, here is the Santa Claus example once again:

> summary(Santa)
  Believe             Age           Gender      Presents    Behaviour
 Mode :logical   Min.   : 4.00   female:23   Min.   : 3.0   bad :26  
 FALSE:25        1st Qu.: 5.00   male  :27   1st Qu.:20.0   good:24  
 TRUE :25        Median : 7.00               Median :26.5            
                 Mean   : 6.86               Mean   :27.0            
                 3rd Qu.: 9.00               3rd Qu.:33.5            
                 Max.   :10.00               Max.   :57.0

Other useful commands for matrices and data frames include dim() (what are the dimensions), rownames() and colnames() to find out (or set) the row and column names:

> dim(Santa)
[1] 50  5
> colnames(Santa)
[1] "Believe"   "Age"       "Gender"    "Presents"  "Behaviour"

Standard Deviation & Variance

You get the standard deviation of a vector x with sd(x), or its variance with var(x). For example, using the ages in the Santa data:

> sd(Santa$Age)
[1] 2.11901
> var(Santa$Age)
[1] 4.490204

We can also check that squaring the standard deviation gives the variance:

> sd(Santa$Age) ^ 2
[1] 4.490204

What if we wanted to break up the data by gender?

> table(Santa$Gender)

female   male 
    23     27

On the page about manipulating data you where shown that a list of logical values could be used to select elements from a matrix or data frame. We can use this to extract only the "male" data.

> Santa$Gender == "male"
 [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
[10] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[19] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
[28] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[37]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[46]  TRUE FALSE FALSE  TRUE  TRUE
> Santa[Santa$Gender == "male",]
   Believe Age Gender Presents Behaviour
1    FALSE   9   male       25   naughty
2     TRUE   5   male       20      nice
4     TRUE   4   male       34   naughty
...
46    TRUE   4   male       34   naughty
49    TRUE  10   male       57      nice
50   FALSE   4   male        3   naughty

In particular, we can get the boy's ages like this:

> Santa[Santa$Gender == "male","Age"]
 [1]  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
[19]  7  8  4  8  9  6  4 10  4

Or like this, which you might find clearer:

> Santa$Age[Santa$Gender == "male"]
 [1]  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
[19]  7  8  4  8  9  6  4 10  4

Now we have a vector of the boys ages, it is trivial to get their standard deviation and a quick summary of distribution:

> sd(Santa[Santa$Gender == "male","Age"])
[1] 2.038099
> summary(Santa[Santa$Gender == "male","Age"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.000   4.500   6.000   6.333   8.000  10.000

For the girls this becomes:

> sd(Santa[Santa$Gender == "female","Age"])
[1] 2.086092
> summary(Santa[Santa$Gender == "female","Age"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.000   5.500   8.000   7.478   9.000  10.000

As you can see, while the range and the standard deviation are about the same for the two groups, but the girls seem to be older.

That code is a bit long, how about this instead?

> tapply(Santa$Age, Santa$Gender, sd)
  female     male 
2.086092 2.038099

or using summary() we get:

> tapply(Santa$Age, Santa$Gender, summary)
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.000   5.500   8.000   7.478   9.000  10.000 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.000   4.500   6.000   6.333   8.000  10.000

This used the "table apply" function, tapply(), to apply the function sd() or summary() to the vector Santa$Age when broken up into a table using the factor vector Santa$Gender (!).

The R Project