## More useless statistics

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as `anscombe`

, but it requires a little reshaping to be useful.

anscombe2 <- with(anscombe, data.frame( x = c(x1, x2, x3, x4), y = c(y1, y2, y3, y4), group = gl(4, nrow(anscombe)) ))

Note the use of `gl`

to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using `ddply`

from the `plyr`

package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

library(plyr) (stats <- ddply(anscombe2, .(group), summarize, mean = mean(y), std_dev = sd(y), correlation = cor(x, y), lm_intercept = lm(y ~ x)$coefficients[1], lm_x_effect = lm(y ~ x)$coefficients[2] )) group mean std_dev correlation lm_intercept lm_x_effect 1 1 7.500909 2.031568 0.8164205 3.000091 0.5000909 2 2 7.500909 2.031657 0.8162365 3.000909 0.5000000 3 3 7.500000 2.030424 0.8162867 3.002455 0.4997273 4 4 7.500909 2.030579 0.8165214 3.001727 0.4999091

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

library(ggplot2) (p <- ggplot(anscombe2, aes(x, y)) + geom_point() + facet_wrap(~ group) )

Each dataset is *really* different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.

the mean and the std are on the y-value only. the slope and the intercept estimates are based on a different model. lets repeat the visualization:

p <- ggplot(anscombe2, aes(x, y)) +

geom_point() +

facet_wrap(~ group)

p

now do:

p + stat_abline()

but now do:

p + stat_smooth(method = "lm")

not the same stuff if you ask me.

einar

einar,

Thanks for the comment. I think you meant

p + stat_abline(intercept = 3, slope = 0.5)

which gives the same line as when you call stat_smooth.

My point was that we routinely calculate a bunch of summary statistics to describe data, but there are many different datasets that conform to those statistics. To truly understand your data, it is often better to visualise it.

I do like this post. I particularly like the way that you make us work for the answer by running the code. This is definitely going in my “you should all use R and plot things” talk.