Posts Tagged ‘data’

A little Christmas Present for you

25th December, 2012 Leave a comment

Here’s an excerpt from my chapter “Blood, sweat and urine” from The Bad Data Handbook. Have a lovely Christmas!

I spent six years working in the statistical modeling team at the UK’s Health and Safety
Laboratory. A large part of my job was working with the laboratory’s chemists, looking
at occupational exposure to various nasty substances to see if an industry was adhering
to safe limits. The laboratory gets sent tens of thousands of blood and urine samples
each year (and sometimes more exotic fluids like sweat or saliva), and has its own team
of occupational hygienists who visit companies and collect yet more samples.
The sample collection process is known as “biological monitoring.” This is because when
the occupational hygienists get home and their partners ask “How was your day?,” “I’ve
been biological monitoring, darling” is more respectable to say than “I spent all day
getting welders to wee into a vial.”
In 2010, I was lucky enough to be given a job swap with James, one of the chemists.
James’s parlour trick is that, after running many thousands of samples, he can tell the
level of creatinine in someone’s urine with uncanny accuracy, just by looking at it. This
skill was only revealed to me after we’d spent an hour playing “guess the creatinine level”
and James had suggested that “we make it more interesting.” I’d lost two packets of fig
rolls before I twigged that I was onto a loser.

The principle of the job swap was that I would spend a week in the lab assisting with
the experiments, and then James would come to my office to help out generating the
statistics. In the process, we’d both learn about each other’s working practices and find
ways to make future projects more efficient.
In the laboratory, I learned how to pipette (harder than it looks), and about the methods
used to ensure that the numbers spat out of the mass spectrometer4 were correct. So as
well as testing urine samples, within each experiment you need to test blanks (distilled
water, used to clean out the pipes, and also to check that you are correctly measuring
zero), calibrators (samples of a known concentration for calibrating the instrument5),
and quality controllers (samples with a concentration in a known range, to make sure
the calibration hasn’t drifted). On top of this, each instrument needs regular maintaining
and recalibrating to ensure its accuracy.
Just knowing that these things have to be done to get sensible answers out of the ma?
chinery was a small revelation. Before I’d gone into the job swap, I didn’t really think
about where my data came from; that was someone else’s problem. From my point of
view, if the numbers looked wrong (extreme outliers, or otherwise dubious values) they
were a mistake; otherwise they were simply “right.” Afterwards, my view is more
nuanced. Now all the numbers look like, maybe not quite a guess, but certainly only an
approximation of the truth. This measurement error is important to remember, though
for health and safety purposes, there’s a nice feature. Values can be out by an order of
magnitude at the extreme low end for some tests, but we don’t need to worry so much
about that. It’s the high exposures that cause health problems, and measurement error
is much smaller at the top end.


Make your data famous!

30th October, 2012 7 comments

I’m writing a book on R for O’Reilly, and I need interesting datasets for the examples. Any data that you provide will get you a mention in the book and in the publicity material, so it’s a great opportunity to publicise your work or your organisation.

Datasets from any area or industry are suitable; the only constraint is that it can be analysed with a few pages of R code to provide a result that a general reader might go “ooh”. There’s a chapter on data cleaning, so even dirty data is suitable!

All the data will be provided in an R package to accompany the book, so you need to be willing to make it publically available. I can help you anonymise the data, or strip out commercially sensitive parts if you require.

If you can provide anything, or you know someone who might be able to, then drop me an email at richierocks AT gmail DOT com. Thanks.

EDIT: There are some (quite) frequently asked questions already! Here are the answers; you can use your Jeopardy! skills to guess the questions.
1. The book is called “Learning R”, and it’s a fairly gentle introduction to the language, covering both how you program in R, and how you analyse data.
2. If you provide data, then yes, you can have an PDF of the pre-release version to make sure I haven’t done something silly with your dataset.

Tags: , , ,

Anonymising data

23rd August, 2011 7 comments

There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up:

Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are complaining about the ridiculous cost of train tickets these days. At this, one of the statisticians pipes up “it’s actually quite reasonable if use our method – we’ve just got one ticket between the three of us”.

The epidemiologists are amazed. “But how do you get away with that?”, they cried in unison.

“Watch and learn” replied a statistician.

A few minutes later, the inspector’s voice was heard down the carriage. At that, the statisticians bundled themselves into the toilet. The inspector knocked on the door. “Tickets please”, she said, and the statisticians passed their single ticket under the door. The inspector stamped it and returned it, and the statisticians made it to the conference.

On the way back, the statisticians again met the epidemiologists. This time, the epidemiologists proudly displayed their single ticket. “Aha”, said a statistician. “This time we have no tickets.” Again the epidemiologists were amazed, but they had little time to ponder it because the inspector was coming down the carriage. The epidemiologists dashed off into the toilet, and soon enough there was a knock on the door. “Tickets please”, they heard, and passed their ticket under the door. The statisticians took the ticket and went off to their own toilet!

The moral of the story being “never use a statistical technique that you don’t understand”.

All this preamble goes by way of saying: data anonymisation isn’t something that I know a great deal about, but I had some ideas and wanted to get feedback from you.

Any personal data of any importance needs to respect the privacy of the people it represents. Data containing financial or medical details in particular should not be exposed for public consumption (at least if you want people to continue providing you with their data). Anonymising data is an important concept in achieving this privacy.

While this is something you need to think about through the whole data lifecycle (from creating it, to storing it – probably in a database – through analysing it, and possibly publishing it) this post focuses on the analysis phase. At this stage, you data is probably in a data frame form, with some identifying columns that need to be anonymised, and some useful values that need to be preserved. Here’s some made-up data, in this case pacman scores of the Avengers.

pacman <- data.frame(
  id                = LETTERS[c(1, 2, 2, 2, 3, 4, 5, 6)],
  first_name        = c("Steve","Tony", 3), "Natasha", "Clint", "Bruce", "Thor"),
  last_name         = c("Rogers","Stark", 3), "Romanoff", "Barton", "Banner", NA),
  alias             = c("Captain America","Iron Man", 3), "Black Widow", 
                        "Hawkeye", "The Hulk", "Thor"),
  gender            = rep(c("Male", "Female", "Male"), times = c(4, 1, 3)),
  pacman_score      = c(round(rlnorm(7, 9, 3), -1), 3333360),
  stringsAsFactors  = FALSE
cols_to_anon <- c("first_name", "last_name", "alias") 

(Naturally, Thor has godlike pacman abilities and achieves a perfect score.) There are two main ways of making data anonymous: removing or obfuscating the personal information, or aggregating it so you only provide summary data.

R has endless ways of aggregating data, tapply and the plyr package should be enough to get you started. This aggregation should be done as late in the day as possible, since summary data is in general less useful than raw data. The rest of the post focuses on removing or obfuscated personal info.

Method 1: Strip personal info columns

If you have an ID column, then the first obvious solution is it simply strip out the columns that reveal identifying information.

  first_name <- NULL
  last_name <- NULL
  alias <- NULL
Method 2: Create an ID column

If there is no ID column, or you don’t want to reveal it (since it gives information about your database, you need an alternative. You can create such an ID column by combining the identifying data into a single factor, then using the underlying integer code as an ID.

simple_id <- function(data, cols_to_anon)
  to_anon <- subset(data, select = cols_to_anon)
  ids <- unname(apply(to_anon, 1, paste, collapse = ""))
pacman$method2_id <- simple_id(pacman, cols_to_anon)  

This is easy, but has the disadvantage that when your dataset is inevitably updated (by adding or removing rows), regenerating the ids will assign different numbers to your rows. It would be useful if you got the same answer for a row regardless of the state of the rest of your dataset.

Method 3: Use digest package to create the ids

The digest package creates hashes of values, which does exactly this.

anonymise <- function(data, cols_to_anon, algo = "sha256")
  if(!require(digest)) stop("digest package is required") 
  to_anon <- subset(data, select = cols_to_anon)
  unname(apply(to_anon, 1, digest, algo = algo))

pacman$method3_id <- anonymise(pacman, cols_to_anon)

(Try adding, deleting or reordering rows to check that you get the same IDs.) This is good enough for most purposes, but for high security cases it’s important to note two caveats. The description of the digest package notes that

this package is not meant to be deployed for cryptographic purposes for which more comprehensive (and widely tested) libraries such as OpenSSL should be used.

Secondly, applying a cryptocraphic hash to the actual values leaves them vulnerable to a rainbow table attack. A rainbow table is a table of all possible strings and their hashes. The attack means that (as long as the string is in the table) breaking the encryption just means looking up the hash in a table. The defense against this is to add some random junk, called “salt”, to the strings that you are encrypting. If you add enough junk, it will be longer than the values in the rainbow table, so you’ve escaped.

generate_salt <- function(data, cols_to_anon, n_chars = 20)
  index <- simple_id(data, cols_to_anon)
  n_indicies <- length(unique(index))   
  chars <- rawToChar(as.raw(32:126), multiple = TRUE)
  x <- replicate(n_indicies, paste(sample(chars, n_chars, replace = TRUE), collapse = ""))

pacman$salt <- generate_salt(pacman, cols_to_anon)
pacman$method4_id <- anonymise(pacman, c(cols_to_anon, "salt")) 

Of course, there’s a problem with this that you may have spotted. Salt is randomly generated, so if you update your dataset, as we discussed above, then you’ll get different salt. (Setting the random seed doesn’t help if you are generating different amounts of salt.) At this point, you might as well just use method 1 or 2, since they are easier.

So the problem of how to create truly secure anonymous data in R isn’t completely solved, for me at least. Let me know in the comments if you have any better ideas.