Home > R > More useless statistics

## More useless statistics

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as `anscombe`, but it requires a little reshaping to be useful.

```anscombe2 <- with(anscombe, data.frame(
x     = c(x1, x2, x3, x4),
y     = c(y1, y2, y3, y4),
group = gl(4, nrow(anscombe))
))
```

Note the use of `gl` to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using `ddply` from the `plyr` package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

```library(plyr)
(stats <- ddply(anscombe2, .(group), summarize,
mean = mean(y),
std_dev = sd(y),
correlation = cor(x, y),
lm_intercept = lm(y ~ x)\$coefficients,
lm_x_effect = lm(y ~ x)\$coefficients
))

group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091
```

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

```library(ggplot2)
(p <- ggplot(anscombe2, aes(x, y)) +
geom_point() +
facet_wrap(~ group)
)
```

Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.

1. 22nd August, 2011 at 23:58 pm

the mean and the std are on the y-value only. the slope and the intercept estimates are based on a different model. lets repeat the visualization:

p <- ggplot(anscombe2, aes(x, y)) +
geom_point() +
facet_wrap(~ group)
p

now do:
p + stat_abline()

but now do:
p + stat_smooth(method = "lm")

not the same stuff if you ask me.

einar

• 23rd August, 2011 at 10:43 am

einar,

Thanks for the comment. I think you meant

p + stat_abline(intercept = 3, slope = 0.5)

which gives the same line as when you call stat_smooth.

My point was that we routinely calculate a bunch of summary statistics to describe data, but there are many different datasets that conform to those statistics. To truly understand your data, it is often better to visualise it.

2. 27th August, 2011 at 8:12 am

I do like this post. I particularly like the way that you make us work for the answer by running the code. This is definitely going in my “you should all use R and plot things” talk.

1. 23rd August, 2011 at 11:51 am
2. 26th August, 2011 at 15:30 pm
3. 30th August, 2011 at 22:25 pm
4. 30th August, 2011 at 22:29 pm