Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!
I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.
The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.
I’m writing a book on R for O’Reilly, and I need interesting datasets for the examples. Any data that you provide will get you a mention in the book and in the publicity material, so it’s a great opportunity to publicise your work or your organisation.
Datasets from any area or industry are suitable; the only constraint is that it can be analysed with a few pages of R code to provide a result that a general reader might go “ooh”. There’s a chapter on data cleaning, so even dirty data is suitable!
All the data will be provided in an R package to accompany the book, so you need to be willing to make it publically available. I can help you anonymise the data, or strip out commercially sensitive parts if you require.
If you can provide anything, or you know someone who might be able to, then drop me an email at richierocks AT gmail DOT com. Thanks.
EDIT: There are some (quite) frequently asked questions already! Here are the answers; you can use your Jeopardy! skills to guess the questions.
1. The book is called “Learning R”, and it’s a fairly gentle introduction to the language, covering both how you program in R, and how you analyse data.
2. If you provide data, then yes, you can have an PDF of the pre-release version to make sure I haven’t done something silly with your dataset.