Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.
The data set is available in R as
anscombe, but it requires a little reshaping to be useful.
anscombe2 <- with(anscombe, data.frame( x = c(x1, x2, x3, x4), y = c(y1, y2, y3, y4), group = gl(4, nrow(anscombe)) ))
Note the use of
gl to autogenerate factor levels.
So we have four sets of x-y data, which we can easily calculate summary statistics from using
ddply from the
plyr package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.
library(plyr) (stats <- ddply(anscombe2, .(group), summarize, mean = mean(y), std_dev = sd(y), correlation = cor(x, y), lm_intercept = lm(y ~ x)$coefficients, lm_x_effect = lm(y ~ x)$coefficients )) group mean std_dev correlation lm_intercept lm_x_effect 1 1 7.500909 2.031568 0.8164205 3.000091 0.5000909 2 2 7.500909 2.031657 0.8162365 3.000909 0.5000000 3 3 7.500000 2.030424 0.8162867 3.002455 0.4997273 4 4 7.500909 2.030579 0.8165214 3.001727 0.4999091
Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)
library(ggplot2) (p <- ggplot(anscombe2, aes(x, y)) + geom_point() + facet_wrap(~ group) )
Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.
A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.
useR has been exhilarating and exhausting. Now it’s finished, I wanted to share my highlights.
10. My inner twelve year old schoolgirl swooning and fainting with excitement every time I chatted with a member of R-core.
9. Patrick Burns declaring that his company consists of himself and his two cats. And that one of the cats keeps changing the settings on his mail reader to spite him.
8. Søren Højsgaard and Robert Goudie both patiently answering my bazillion stupid questions on Bayesian networks.
7. Audience members high-fiving each other during my talk.
6. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow, so I can spend more time doing analysis that matters.
5. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow. Oh, wait.
4. Jason and Tobias from OpenAnalytics talking about lab automation, so I can get those darned chemists off my back. (Just kidding, chemists; I love your data.)
3. Blogist Tal Galili volunteering as my speaking coach. (The big secret: talking really slowly to yourself before a presentation stops you overloading on adrenalin before you get up.)
2. Jonathan Rougier getting the audience to go “ah” every time he mentioned donkeys. And getting me to understand what the point of nomograms are.
1. Ben French teaching me that there is a third joke about stats. (I’ll tell you the other two another time.) It goes like this:
An engineer, a chemist and a statistician are staying in a hotel. (It’s a triple room, very cosy.) The first night they are there, a fire breaks out and wakes them up. The engineer gets up, grabs the fire extinguisher and puts out the fire. Later, they’re all woken by another fire (very dodgy health and safety). The chemist thinks “the fire reaction requires oxygen”, grabs a fire blanket and smothers the flames until they go out. Even later, the engineer and the chemist are woken by the statistician lighting a series of fires in the corner. “What are you doing?”, they cry in unison. “Increasing the sample size”, replies the statistician.