# Summarising Data In R

R contains a number of simple summary statistics for use on R objects. Typical functions are length(), mean(), median(), max(), min(), range() and summary(). For example:

```> x <- c(5,3,3,6,7,9,2,4,1,9,7,9,4,4,8)
> length(x)
 15
> mean(x)
 5.4
> median(x)
 5
> min(x)
 1
> max(x)
 9
> range(x)
 1 9```

On simple vectors, summary() gives the range, median, mean and interquartile range:

```> summary(x)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.0     3.5     5.0     5.4     7.5     9.0```

The summary() command can also be used on data frames and other more complicated objects too. For example, here is the Santa Claus example once again:

```> summary(Santa)
Believe             Age           Gender      Presents    Behaviour
Mode :logical   Min.   : 4.00   female:23   Min.   : 3.0   bad :26
FALSE:25        1st Qu.: 5.00   male  :27   1st Qu.:20.0   good:24
TRUE :25        Median : 7.00               Median :26.5
Mean   : 6.86               Mean   :27.0
3rd Qu.: 9.00               3rd Qu.:33.5
Max.   :10.00               Max.   :57.0 ```

Other useful commands for matrices and data frames include dim() (what are the dimensions), rownames() and colnames() to find out (or set) the row and column names:

```> dim(Santa)
 50  5
> colnames(Santa)
 "Believe"   "Age"       "Gender"    "Presents"  "Behaviour"```

## Standard Deviation & Variance

You get the standard deviation of a vector x with sd(x), or its variance with var(x). For example, using the ages in the Santa data:

```> sd(Santa\$Age)
 2.11901
> var(Santa\$Age)
 4.490204```

We can also check that squaring the standard deviation gives the variance:

```> sd(Santa\$Age) ^ 2
 4.490204```

What if we wanted to break up the data by gender?

```> table(Santa\$Gender)

female   male
23     27 ```

On the page about manipulating data you where shown that a list of logical values could be used to select elements from a matrix or data frame. We can use this to extract only the "male" data.

```> Santa\$Gender == "male"
  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
 FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
  TRUE FALSE FALSE  TRUE  TRUE
> Santa[Santa\$Gender == "male",]
Believe Age Gender Presents Behaviour
1    FALSE   9   male       25   naughty
2     TRUE   5   male       20      nice
4     TRUE   4   male       34   naughty
...
46    TRUE   4   male       34   naughty
49    TRUE  10   male       57      nice
50   FALSE   4   male        3   naughty```

In particular, we can get the boy's ages like this:

```> Santa[Santa\$Gender == "male","Age"]
  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
  7  8  4  8  9  6  4 10  4```

Or like this, which you might find clearer:

```> Santa\$Age[Santa\$Gender == "male"]
  9  5  4  4  6  5  7  5  5  4  8  8  9  4 10  7  5  6
  7  8  4  8  9  6  4 10  4```

Now we have a vector of the boys ages, it is trivial to get their standard deviation and a quick summary of distribution:

```> sd(Santa[Santa\$Gender == "male","Age"])
 2.038099
> summary(Santa[Santa\$Gender == "male","Age"])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   4.500   6.000   6.333   8.000  10.000```

For the girls this becomes:

```> sd(Santa[Santa\$Gender == "female","Age"])
 2.086092
> summary(Santa[Santa\$Gender == "female","Age"])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   5.500   8.000   7.478   9.000  10.000```

As you can see, while the range and the standard deviation are about the same for the two groups, but the girls seem to be older.

```> tapply(Santa\$Age, Santa\$Gender, sd)
female     male
2.086092 2.038099 ```

or using summary() we get:

```> tapply(Santa\$Age, Santa\$Gender, summary)
\$female
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   5.500   8.000   7.478   9.000  10.000

\$male
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.000   4.500   6.000   6.333   8.000  10.000```

This used the "table apply" function, tapply(), to apply the function sd() or summary() to the vector Santa\$Age when broken up into a table using the factor vector Santa\$Gender (!).