Home > R > Anonymising data

Anonymising data

There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up:

Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are complaining about the ridiculous cost of train tickets these days. At this, one of the statisticians pipes up “it’s actually quite reasonable if use our method – we’ve just got one ticket between the three of us”.

The epidemiologists are amazed. “But how do you get away with that?”, they cried in unison.

“Watch and learn” replied a statistician.

A few minutes later, the inspector’s voice was heard down the carriage. At that, the statisticians bundled themselves into the toilet. The inspector knocked on the door. “Tickets please”, she said, and the statisticians passed their single ticket under the door. The inspector stamped it and returned it, and the statisticians made it to the conference.

On the way back, the statisticians again met the epidemiologists. This time, the epidemiologists proudly displayed their single ticket. “Aha”, said a statistician. “This time we have no tickets.” Again the epidemiologists were amazed, but they had little time to ponder it because the inspector was coming down the carriage. The epidemiologists dashed off into the toilet, and soon enough there was a knock on the door. “Tickets please”, they heard, and passed their ticket under the door. The statisticians took the ticket and went off to their own toilet!

The moral of the story being “never use a statistical technique that you don’t understand”.

All this preamble goes by way of saying: data anonymisation isn’t something that I know a great deal about, but I had some ideas and wanted to get feedback from you.

Any personal data of any importance needs to respect the privacy of the people it represents. Data containing financial or medical details in particular should not be exposed for public consumption (at least if you want people to continue providing you with their data). Anonymising data is an important concept in achieving this privacy.

While this is something you need to think about through the whole data lifecycle (from creating it, to storing it – probably in a database – through analysing it, and possibly publishing it) this post focuses on the analysis phase. At this stage, you data is probably in a data frame form, with some identifying columns that need to be anonymised, and some useful values that need to be preserved. Here’s some made-up data, in this case pacman scores of the Avengers.

pacman <- data.frame(
  id                = LETTERS[c(1, 2, 2, 2, 3, 4, 5, 6)],
  first_name        = c("Steve", rep.int("Tony", 3), "Natasha", "Clint", "Bruce", "Thor"),
  last_name         = c("Rogers", rep.int("Stark", 3), "Romanoff", "Barton", "Banner", NA),
  alias             = c("Captain America", rep.int("Iron Man", 3), "Black Widow", 
                        "Hawkeye", "The Hulk", "Thor"),
  gender            = rep(c("Male", "Female", "Male"), times = c(4, 1, 3)),
  pacman_score      = c(round(rlnorm(7, 9, 3), -1), 3333360),
  stringsAsFactors  = FALSE
)
cols_to_anon <- c("first_name", "last_name", "alias") 

(Naturally, Thor has godlike pacman abilities and achieves a perfect score.) There are two main ways of making data anonymous: removing or obfuscating the personal information, or aggregating it so you only provide summary data.

R has endless ways of aggregating data, tapply and the plyr package should be enough to get you started. This aggregation should be done as late in the day as possible, since summary data is in general less useful than raw data. The rest of the post focuses on removing or obfuscated personal info.

Method 1: Strip personal info columns

If you have an ID column, then the first obvious solution is it simply strip out the columns that reveal identifying information.

within(pacman, 
{
  first_name <- NULL
  last_name <- NULL
  alias <- NULL
})
Method 2: Create an ID column

If there is no ID column, or you don’t want to reveal it (since it gives information about your database, you need an alternative. You can create such an ID column by combining the identifying data into a single factor, then using the underlying integer code as an ID.

simple_id <- function(data, cols_to_anon)
{
  to_anon <- subset(data, select = cols_to_anon)
  ids <- unname(apply(to_anon, 1, paste, collapse = ""))
  as.integer(factor(ids))
}
pacman$method2_id <- simple_id(pacman, cols_to_anon)  

This is easy, but has the disadvantage that when your dataset is inevitably updated (by adding or removing rows), regenerating the ids will assign different numbers to your rows. It would be useful if you got the same answer for a row regardless of the state of the rest of your dataset.

Method 3: Use digest package to create the ids

The digest package creates hashes of values, which does exactly this.

anonymise <- function(data, cols_to_anon, algo = "sha256")
{
  if(!require(digest)) stop("digest package is required") 
  to_anon <- subset(data, select = cols_to_anon)
  unname(apply(to_anon, 1, digest, algo = algo))
}

pacman$method3_id <- anonymise(pacman, cols_to_anon)

(Try adding, deleting or reordering rows to check that you get the same IDs.) This is good enough for most purposes, but for high security cases it’s important to note two caveats. The description of the digest package notes that

this package is not meant to be deployed for cryptographic purposes for which more comprehensive (and widely tested) libraries such as OpenSSL should be used.

Secondly, applying a cryptocraphic hash to the actual values leaves them vulnerable to a rainbow table attack. A rainbow table is a table of all possible strings and their hashes. The attack means that (as long as the string is in the table) breaking the encryption just means looking up the hash in a table. The defense against this is to add some random junk, called “salt”, to the strings that you are encrypting. If you add enough junk, it will be longer than the values in the rainbow table, so you’ve escaped.

generate_salt <- function(data, cols_to_anon, n_chars = 20)
{                                                                
  index <- simple_id(data, cols_to_anon)
  n_indicies <- length(unique(index))   
  chars <- rawToChar(as.raw(32:126), multiple = TRUE)
  x <- replicate(n_indicies, paste(sample(chars, n_chars, replace = TRUE), collapse = ""))
  x[index]
}

pacman$salt <- generate_salt(pacman, cols_to_anon)
pacman$method4_id <- anonymise(pacman, c(cols_to_anon, "salt")) 

Of course, there’s a problem with this that you may have spotted. Salt is randomly generated, so if you update your dataset, as we discussed above, then you’ll get different salt. (Setting the random seed doesn’t help if you are generating different amounts of salt.) At this point, you might as well just use method 1 or 2, since they are easier.

So the problem of how to create truly secure anonymous data in R isn’t completely solved, for me at least. Let me know in the comments if you have any better ideas.

About these ads
  1. fropome
    23rd August, 2011 at 19:13 pm

    I don’t think that this would often be an issue where I work. Usually at least some of the data to be analysed would itself allow individuals to be identified. As such, datasets should be stored in secure locations while this analysis is conducted.
    I’m not sure why you need to create key values from scratch each time. Why not just order the dataset randomly, give each row an incremental number key (1,2,3 etc) and then a new row would get the next number? If you don’t want people to be able to identify these new rows, then of course you’d have to create keys from scratch, but they’d also have to be new.

    Have I missed your point somewhere?

    • 23rd August, 2011 at 22:05 pm

      Maybe I’m just overthinking things and getting muddled. One of the benefits of anonymising is that you need to worry less about secure storage. But my thought process started when I was trying to make a dataset with confidential info publicly available. I started by removing names, then added an ID column that simply numbered each person – as per method 2, and as you just suggested. Then I deleted some rows, and it bugged me a little that my numbering system had gaps.

      I realised that using digest would solve this, but it causes other problems, and the more I read up on cryptography, the harder it seems.

  2. jtt
    24th August, 2011 at 7:07 am

    Like fropome said, anonymizing data is seldom enough, if the data are sensitive, i.e., medical or survey data, where individuals can possibly be identified using, e.g., geographic information (address, postal office, etc.). National institutes that create official statistics do ponder these issue quite a lot, and there are some implementations of the methdology in R, also. See, for example, packages sdcMicro and sdcTable. I now realize that this comment might slightly off-topic, though.

    • 24th August, 2011 at 11:53 am

      Thanks for the pointer to those packages. I agree that geographic data can make it tricky to truly anonymise data, though I think there are still many datasets, were full anonymisation is possible.

  3. Jens
    25th August, 2011 at 13:55 pm

    Richie, you rock. However, there’s a fourth joke on (clinical) statisticians.

    The wife of a statistician has twins. The statistician is delighted and rings the minister, who is delighted, too.
    “Bring them to church on Sunday and we’ll baptize them,” said the minister.
    “Baptize only one.” replied the statistician. “We’ll keep the other as a control.”

  4. 26th August, 2011 at 2:15 am

    Earlier this week I had a situation where I had to anonymise User IDs for a statistical analysis. In fact, I had to give my anonymising code to the client to applied to the User IDs before I could see the data. I wish I had seen your post then because at first I generated UUIDs with the Ruuid package on bioconductor. The Ruuid package has been deprecated and will soon be removed from Bioconductor; it’s not even on CRAN.

    I ended up using an MD5 hash with the digest package, but I was unaware of applying salt.

    Thanks for the nice write up.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 218 other followers

%d bloggers like this: