Plotting the Iris Data
Did you know R has a built in graphics demonstration? Type demo(graphics) at the prompt, and its produce a series of images (and shows you the code to generate them). This page was inspired by the eighth and ninth demo examples.
First I introduce the Iris data and draw some simple scatter plots, then show how to create plots like this:
In the follow-on page I then have a quick look at using linear regressions and linear models to analyse the trends.
The Data
The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica). On this page there are photos of the three species, and some notes on classification based on sepal area versus petal area.
We can inspect the data in R like this:
> iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa ... 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica
The iris variable is a data.frame - its like a matrix but the columns may be of different types, and we can access the columns by name:
> class(iris) [1] "data.frame" > colnames(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > iris$Petal.Length [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3 [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 [145] 5.7 5.2 5.0 5.2 5.4 5.1
You can also get the petal lengths by iris[,"Petal.Length"] or iris[,3] (treating the data frame like a matrix/array).
Simple Scatter Plots
Lets do a simple scatter plot, petal length vs. petal width:
> plot(iris$Petal.Length, iris$Petal.Width, main="Edgar Anderson's Iris Data")
Its interesting to mark or colour in the points by species. We could use the pch argument (plot character) for this. Consulting the help, we might use pch=21 for filled circles, pch=22 for filled squares, pch=23 for filled diamonds, pch=24 or pch=25 for up/down triangles. Doing this would change all the points... the trick is to create a list mapping the species to say 23, 24 or 25 and use that as the pch argument:
> plot(iris$Petal.Length, iris$Petal.Width, pch=c(23,24,25)[unclass(iris$Species)], main="Edgar Anderson's Iris Data")
This works by using c(23,24,25) to create a vector, and then selecting elements 1, 2 or 3 from it. How? unclass(iris$Species) turns the list of species from a list of categories (a "factor" data type in R terminology) into a list of ones, twos and threes:
> c(23,24,25)[unclass(iris$Species)] [1] 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 [25] 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 [49] 23 23 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 [73] 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 [97] 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 [121] 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 [145] 25 25 25 25 25 25
We can do the same trick to generate a list of colours, and use this on our scatter plot:
> plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data")
Using different colours its even more clear that the three species have very different petal sizes.
Draftsman's or Pairs Scatter Plots
How do the other variables behave? We could generate each plot individually, but there is quicker way, using the pairs command on the first four columns:
> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
This type of image is also called a Draftsman's display - it shows the possible two-dimensional projections of multidimensional data (in this case, four dimensional). An actual engineer might use this to represent three dimensional physical objects.
It looks like most of the variables could be used to predict the species - except that using the sepal length and width alone would make distinguishing Iris versicolor and virginica tricky (green and blue)
This is starting to get complicated, but we can write our own function to draw something else for the upper panels, such as the Pearson's correlation:
> panel.pearson <- function(x, y, ...) {
horizontal <- (par("usr")[1] + par("usr")[2]) / 2;
vertical <- (par("usr")[3] + par("usr")[4]) / 2;
text(horizontal, vertical, format(abs(cor(x,y)), digits=2))
}
> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red","green3","blue")[unclass(iris$Species)], upper.panel=panel.pearson)
Here is another variation, with some different options showing only the upper panels, and with alternative captions on the diagonals:
> pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)
There are some more complicated examples (without pictures) of Customized Scatterplot Ideas over at the California Soil Resource Lab. You might also want to look at the function splom in the lattice package...
In the follow-on page I then have a quick look at using linear regressions and linear models to analyse the trends.