Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!
I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.
The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.
I’ve just answered my hundred billionth question on Stack Overflow that goes something like
I want to calculate some statistic for lots of different groups.
Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.
First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.
head(InsectSprays) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A
These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.
First, we split the
count column by the
(count_by_spray <- with(InsectSprays, split(count, spray)))
Secondly, we apply the statistic to each element of the list. Lets use the
(mean_by_spray <- lapply(count_by_spray, mean))
Finally, (if possible) we recombine the list as a vector.
This procedure is such a common thing that there are many functions to speed up the process.
vapply do the last two steps together.
sapply(count_by_spray, mean) vapply(count_by_spray, mean, numeric(1))
We can do even better than that however.
by all provide a one-function solution to these S-A-C problems.
with(InsectSprays, tapply(count, spray, mean)) with(InsectSprays, by(count, spray, mean)) aggregate(count ~ spray, InsectSprays, mean)
plyr package also provides several solutions, with a choice of output format.
ddply takes a data frame and returned another data frame, which is what you’ll want most of the time.
dlply takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.
ddply(InsectSprays, .(spray), summarise, mean.count = mean(count)) dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))
You can read much more on this type of problem and the
plyr solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.
One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the
ave function (which provides
mean as the default function).
with(InsectSprays, ave(count, spray))
Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.
The data set is available in R as
anscombe, but it requires a little reshaping to be useful.
anscombe2 <- with(anscombe, data.frame( x = c(x1, x2, x3, x4), y = c(y1, y2, y3, y4), group = gl(4, nrow(anscombe)) ))
Note the use of
gl to autogenerate factor levels.
So we have four sets of x-y data, which we can easily calculate summary statistics from using
ddply from the
plyr package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.
library(plyr) (stats <- ddply(anscombe2, .(group), summarize, mean = mean(y), std_dev = sd(y), correlation = cor(x, y), lm_intercept = lm(y ~ x)$coefficients, lm_x_effect = lm(y ~ x)$coefficients )) group mean std_dev correlation lm_intercept lm_x_effect 1 1 7.500909 2.031568 0.8164205 3.000091 0.5000909 2 2 7.500909 2.031657 0.8162365 3.000909 0.5000000 3 3 7.500000 2.030424 0.8162867 3.002455 0.4997273 4 4 7.500909 2.030579 0.8165214 3.001727 0.4999091
Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)
library(ggplot2) (p <- ggplot(anscombe2, aes(x, y)) + geom_point() + facet_wrap(~ group) )
Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.
A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.