Archive

Archive for August, 2011

Nomograms everywhere!

At useR!, Jonty Rougier talked about nomograms, a once popular visualisation that has fallen by the wayside with the rise of computers. I’d seen a few before, but hadn’t understood how they worked or why you’d want to use them. Anyway, since that talk I’ve been digging around in biology books from the 60s and 70s, and it seems they are full of them. So for those of you who haven’t seen the talk, here’s how they work.

A basic nomogram consists of three scales. By reading off known values from two of the scales, you can estimate a third one. Here’s an example I found in the ICRP‘s reference manual.

It’s difficult to measure people’s skin surface area, but height and bodyweight are very straightforward. To use the nomogram, you place a ruler (or other straight edge) on the height* and weight scales and read of the point where the ruler crosses the surface area scale. I’m 177cm tall and weigh 72kg, so according to this, my estimated skin surface area is 1.89m2.

Of course nowadays, the standard way to solve this sort of problem is to write a function. Jonty suggested that the main modern use of nomograms is in fieldwork situations, where computers aren’t handily available. (His case study was Kenyan vets trying to estimate the weight of donkeys form there height and girth.)

Altman and Dittmer’s Respiration and Circulation has many more pretty nomograms. I was particularly impressed by those on blood pH, reproduced below for your pleasure.

Your homework is to dig out a pre-1980 textbook and hunt for more nomograms.

*Gruesomely, the fact that the scale is labelled “length” rather than “height” makes me suspect that the bodies that provided the data were in a permanent lying down position – that is, they were corpses.

Anonymising data

23rd August, 2011 7 comments

There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up:

Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are complaining about the ridiculous cost of train tickets these days. At this, one of the statisticians pipes up “it’s actually quite reasonable if use our method – we’ve just got one ticket between the three of us”.

The epidemiologists are amazed. “But how do you get away with that?”, they cried in unison.

“Watch and learn” replied a statistician.

A few minutes later, the inspector’s voice was heard down the carriage. At that, the statisticians bundled themselves into the toilet. The inspector knocked on the door. “Tickets please”, she said, and the statisticians passed their single ticket under the door. The inspector stamped it and returned it, and the statisticians made it to the conference.

On the way back, the statisticians again met the epidemiologists. This time, the epidemiologists proudly displayed their single ticket. “Aha”, said a statistician. “This time we have no tickets.” Again the epidemiologists were amazed, but they had little time to ponder it because the inspector was coming down the carriage. The epidemiologists dashed off into the toilet, and soon enough there was a knock on the door. “Tickets please”, they heard, and passed their ticket under the door. The statisticians took the ticket and went off to their own toilet!

The moral of the story being “never use a statistical technique that you don’t understand”.

All this preamble goes by way of saying: data anonymisation isn’t something that I know a great deal about, but I had some ideas and wanted to get feedback from you.

Any personal data of any importance needs to respect the privacy of the people it represents. Data containing financial or medical details in particular should not be exposed for public consumption (at least if you want people to continue providing you with their data). Anonymising data is an important concept in achieving this privacy.

While this is something you need to think about through the whole data lifecycle (from creating it, to storing it – probably in a database – through analysing it, and possibly publishing it) this post focuses on the analysis phase. At this stage, you data is probably in a data frame form, with some identifying columns that need to be anonymised, and some useful values that need to be preserved. Here’s some made-up data, in this case pacman scores of the Avengers.

```pacman <- data.frame(
id                = LETTERS[c(1, 2, 2, 2, 3, 4, 5, 6)],
first_name        = c("Steve", rep.int("Tony", 3), "Natasha", "Clint", "Bruce", "Thor"),
last_name         = c("Rogers", rep.int("Stark", 3), "Romanoff", "Barton", "Banner", NA),
alias             = c("Captain America", rep.int("Iron Man", 3), "Black Widow",
"Hawkeye", "The Hulk", "Thor"),
gender            = rep(c("Male", "Female", "Male"), times = c(4, 1, 3)),
pacman_score      = c(round(rlnorm(7, 9, 3), -1), 3333360),
stringsAsFactors  = FALSE
)
cols_to_anon <- c("first_name", "last_name", "alias")
```

(Naturally, Thor has godlike pacman abilities and achieves a perfect score.) There are two main ways of making data anonymous: removing or obfuscating the personal information, or aggregating it so you only provide summary data.

R has endless ways of aggregating data, `tapply` and the `plyr` package should be enough to get you started. This aggregation should be done as late in the day as possible, since summary data is in general less useful than raw data. The rest of the post focuses on removing or obfuscated personal info.

Method 1: Strip personal info columns

If you have an ID column, then the first obvious solution is it simply strip out the columns that reveal identifying information.

```within(pacman,
{
first_name <- NULL
last_name <- NULL
alias <- NULL
})
```
Method 2: Create an ID column

If there is no ID column, or you don’t want to reveal it (since it gives information about your database, you need an alternative. You can create such an ID column by combining the identifying data into a single factor, then using the underlying integer code as an ID.

```simple_id <- function(data, cols_to_anon)
{
to_anon <- subset(data, select = cols_to_anon)
ids <- unname(apply(to_anon, 1, paste, collapse = ""))
as.integer(factor(ids))
}
pacman\$method2_id <- simple_id(pacman, cols_to_anon)
```

This is easy, but has the disadvantage that when your dataset is inevitably updated (by adding or removing rows), regenerating the ids will assign different numbers to your rows. It would be useful if you got the same answer for a row regardless of the state of the rest of your dataset.

Method 3: Use digest package to create the ids

The `digest` package creates hashes of values, which does exactly this.

```anonymise <- function(data, cols_to_anon, algo = "sha256")
{
if(!require(digest)) stop("digest package is required")
to_anon <- subset(data, select = cols_to_anon)
unname(apply(to_anon, 1, digest, algo = algo))
}

pacman\$method3_id <- anonymise(pacman, cols_to_anon)
```

(Try adding, deleting or reordering rows to check that you get the same IDs.) This is good enough for most purposes, but for high security cases it’s important to note two caveats. The description of the digest package notes that

this package is not meant to be deployed for cryptographic purposes for which more comprehensive (and widely tested) libraries such as OpenSSL should be used.

Secondly, applying a cryptocraphic hash to the actual values leaves them vulnerable to a rainbow table attack. A rainbow table is a table of all possible strings and their hashes. The attack means that (as long as the string is in the table) breaking the encryption just means looking up the hash in a table. The defense against this is to add some random junk, called “salt”, to the strings that you are encrypting. If you add enough junk, it will be longer than the values in the rainbow table, so you’ve escaped.

```generate_salt <- function(data, cols_to_anon, n_chars = 20)
{
index <- simple_id(data, cols_to_anon)
n_indicies <- length(unique(index))
chars <- rawToChar(as.raw(32:126), multiple = TRUE)
x <- replicate(n_indicies, paste(sample(chars, n_chars, replace = TRUE), collapse = ""))
x[index]
}

pacman\$salt <- generate_salt(pacman, cols_to_anon)
pacman\$method4_id <- anonymise(pacman, c(cols_to_anon, "salt"))
```

Of course, there’s a problem with this that you may have spotted. Salt is randomly generated, so if you update your dataset, as we discussed above, then you’ll get different salt. (Setting the random seed doesn’t help if you are generating different amounts of salt.) At this point, you might as well just use method 1 or 2, since they are easier.

So the problem of how to create truly secure anonymous data in R isn’t completely solved, for me at least. Let me know in the comments if you have any better ideas.

More useless statistics

22nd August, 2011 7 comments

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as `anscombe`, but it requires a little reshaping to be useful.

```anscombe2 <- with(anscombe, data.frame(
x     = c(x1, x2, x3, x4),
y     = c(y1, y2, y3, y4),
group = gl(4, nrow(anscombe))
))
```

Note the use of `gl` to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using `ddply` from the `plyr` package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

```library(plyr)
(stats <- ddply(anscombe2, .(group), summarize,
mean = mean(y),
std_dev = sd(y),
correlation = cor(x, y),
lm_intercept = lm(y ~ x)\$coefficients[1],
lm_x_effect = lm(y ~ x)\$coefficients[2]
))

group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091
```

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

```library(ggplot2)
(p <- ggplot(anscombe2, aes(x, y)) +
geom_point() +
facet_wrap(~ group)
)
```

Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.

useR2011 highlights

18th August, 2011 11 comments

useR has been exhilarating and exhausting. Now it’s finished, I wanted to share my highlights.

10. My inner twelve year old schoolgirl swooning and fainting with excitement every time I chatted with a member of R-core.

9. Patrick Burns declaring that his company consists of himself and his two cats. And that one of the cats keeps changing the settings on his mail reader to spite him.

8. Søren Højsgaard and Robert Goudie both patiently answering my bazillion stupid questions on Bayesian networks.

7. Audience members high-fiving each other during my talk.

6. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow, so I can spend more time doing analysis that matters.

5. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow. Oh, wait.

4. Jason and Tobias from OpenAnalytics talking about lab automation, so I can get those darned chemists off my back. (Just kidding, chemists; I love your data.)

3. Blogist Tal Galili volunteering as my speaking coach. (The big secret: talking really slowly to yourself before a presentation stops you overloading on adrenalin before you get up.)

2. Jonathan Rougier getting the audience to go “ah” every time he mentioned donkeys. And getting me to understand what the point of nomograms are.

1. Ben French teaching me that there is a third joke about stats. (I’ll tell you the other two another time.) It goes like this:

An engineer, a chemist and a statistician are staying in a hotel. (It’s a triple room, very cosy.) The first night they are there, a fire breaks out and wakes them up. The engineer gets up, grabs the fire extinguisher and puts out the fire. Later, they’re all woken by another fire (very dodgy health and safety). The chemist thinks “the fire reaction requires oxygen”, grabs a fire blanket and smothers the flames until they go out. Even later, the engineer and the chemist are woken by the statistician lighting a series of fires in the corner. “What are you doing?”, they cry in unison. “Increasing the sample size”, replies the statistician.

useR2011 Easy interactive ggplots talk

17th August, 2011 10 comments

I’m talking tomorrow at useR! on making ggplots interactive with the gWidgets GUI framework. For those of you at useR, here is the code and data, so you can play along on your laptops. For everyone else, I’ll make the slides available in the next few days so you can see what you missed. Note that for confidentiality reasons, I’ve added some random junk to the sample values, so please don’t use this data for actual research. (If you are interested in chemical exposure data, contact the lab.)

Chromium data. Nickel data. Code for examples 1 to 4. Once we introduce example 5, the rest of the code needs a little reworking.

Update: The permissions on the code files were blocking downloads from people who aren’t me. My bad. It should be fixed now.

Another update: By popular demand, here are the code chunks.

```library(ggplot2)
library(gWidgetstcltk)

p <- ggplot(chromium, aes(air, bm)) +
geom_point()

win_ctrls <- gwindow("Plot controls 1-4")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
Linear      = "identity",
Log           = "log10"
)

frm_scale_trans_y <- gframe(
"Y scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_y,
handler = function(h, ...)
{
scale_trans_y <- available_scales[svalue(h\$obj)]
p <<- p +
scale_y_continuous(
trans = scale_trans_y
)
print(p)
}
)

frm_scale_trans_x <- gframe(
"X scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_x,
handler = function(h, ...)
{
scale_trans_x <- available_scales[svalue(h\$obj)]
p <<- p +
scale_x_continuous(
trans = scale_trans_x
)
print(p)
}
)

#2 Changing facets
facet_choices <- list(
None = ". ~ .",
"RPE" = ". ~ rpe",
"Welding type" = ". ~ welding.type",
"RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
"Faceting",
container = grp_ctrls,
expand = TRUE
)

names(facet_choices),
container = frm_facets,
handler = function(h, ...)
{
facet_formula <-
facet_choices[[svalue(h\$obj)]]
p <<- p +
facet_grid(facet_formula)
print(p)
}
)

#3 Changing the dataset
frm_datasets <- gframe(
"Datasets",
container = grp_ctrls,
expand = TRUE
)

cmb_datasets <- gcombobox(
c("chromium", "nickel"),
container = frm_datasets,
handler = function(h, ...)
{
p <<- p %+% get(svalue(h\$obj))
print(p)
}
)

#4 Changing the title
frm_title <- gframe(
"Plot title",
container = grp_ctrls,
expand = TRUE
)

txt_title <- gedit(
p\$options\$title,
container = frm_title,
handler = function(h, ...)
{
p <<- p + opts(title = svalue(txt_title))
print(p)
}
)
```
```library(ggplot2)
library(gWidgetstcltk)

p <- ggplot(chromium, aes(air, bm)) +
geom_blank()

print_p <- function()
{
pp <- get("p", envir = globalenv())
if(svalue(chk_points)) pp <- pp + geom_point()
if(svalue(chk_smooth)) pp <- pp + geom_smooth()
print(pp)
}

win_ctrls <- gwindow("Plot controls 1-5")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
Linear      = "identity",
Log           = "log10"
)

frm_scale_trans_y <- gframe(
"Y scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_y,
handler = function(h, ...)
{
scale_trans_y <- available_scales[svalue(h\$obj)]
p <<- p +
scale_y_continuous(
trans = scale_trans_y
)
print_p()
}
)

frm_scale_trans_x <- gframe(
"X scale transformation",
container = grp_ctrls,
expand = TRUE
)

names(available_scales),
container = frm_scale_trans_x,
handler = function(h, ...)
{
scale_trans_x <- available_scales[svalue(h\$obj)]
p <<- p +
scale_x_continuous(
trans = scale_trans_x
)
print_p()
}
)

#2 Changing facets
facet_choices <- list(
None = ". ~ .",
"RPE" = ". ~ rpe",
"Welding type" = ". ~ welding.type",
"RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
"Faceting",
container = grp_ctrls,
expand = TRUE
)

names(facet_choices),
container = frm_facets,
handler = function(h, ...)
{
facet_formula <-
facet_choices[[svalue(h\$obj)]]
p <<- p +
facet_grid(facet_formula)
print_p()
}
)

#3 Changing the dataset
frm_datasets <- gframe(
"Datasets",
container = grp_ctrls,
expand = TRUE
)

cmb_datasets <- gcombobox(
c("chromium", "nickel"),
container = frm_datasets,
handler = function(h, ...)
{
p <<- p %+% get(svalue(h\$obj))
print_p()
}
)

#4 Changing the title
frm_title <- gframe(
"Plot title",
container = grp_ctrls,
expand = TRUE
)

txt_title <- gedit(
p\$options\$title,
container = frm_title,
handler = function(h, ...)
{
p <<- p + opts(title = svalue(txt_title))
print_p()
}
)

#5 Changing geoms
frm_geoms <- gframe(
container = grp_ctrls,
expand = TRUE
)

chk_points <- gcheckbox(
"Points",
container = frm_geoms,
handler = function(h, ...) print_p()
)
chk_smooth <- gcheckbox(
"Smooth curve",
container = frm_geoms,
handler = function(h, ...) print_p()
)
```

Stop! (In the name of a sensible interface)

12th August, 2011 1 comment

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times `if`, `ifelse` or `switch` is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

```get_all_fns <- function(pattern = ".+")
{
fn_names <- apropos(pattern)
fns <- lapply(fn_names, get)
names(fns) <- fn_names
Filter(is.function, fns)
}
fns <- get_all_fns()
```

As before, we use `deparse` to turn the function’s body into an array of strings to examine. This time, we are looking for calls to `if`, `ifelse` or `switch`.

```
get_complexity <- function(fn)
{
body_lines <- deparse(body(fn))
flow <- c("if", "ifelse", "switch")
rx <- paste(flow, " *\\(", collapse = "|", sep = "")
body_lines <- body_lines[grepl(rx, body_lines)]
length(body_lines)
}
complexity <- sapply(fns, get_complexity)
```

Let’s take a look at the distribution of this complexity measure.

```
library(ggplot2)
hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) +
geom_histogram(binwidth = 3)
hist_complexity
```

Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

```head(sort(complexity, decreasing=TRUE))
library          arima    help.search       read.DIF         coplot [<-.data.frame
84             81             71             66             65             63
```

Hmm, it’s the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should “fail early and fail often”. That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don’t waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to `stop` and `stopifnot`.

```
get_n_exceptions <- function(fn)
{
body_lines <- deparse(body(fn))
flow <- c("stop", "stopifnot")
rx <- paste(flow, " *\\(", collapse = "|", sep = "")
body_lines <- body_lines[grepl(rx, body_lines)]
length(body_lines)
}
n_exceptions <- sapply(fns, get_n_exceptions)
```

Once again we examine the distribution …

```
hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) +
geom_histogram(binwidth = 1)
hist_exceptions
```

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

```head(sort(n_exceptions, decreasing=TRUE))
read.DIF        library [<-.data.frame          arima         arima0        glm.fit
17             16             15             14             13             13
```

The function with the most potential exceptions to throw is `read.DIF`. File handling is notoriously problematic, so that’s fair enough. Load the `survival` package for a better example. The `Surv` function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to `stop` to prevent disaster.

So when you are writing a function and you see the 20th call to `stop`, that’s a hint that you may need to stop (if you want a sensible interface).

Monster functions (Raaargh!)

It’s widely considered good programming practice to have lots of little functions rather than a few big functions. The reasons behind this are simple. When your program breaks, it’s much nicer to debug a five line function than a five hundred line function. Additionally, by breaking up your code into little chunks, you often find that some of those chunks are reusable in other contexts, saving you re-writing code in your next project. The process of breaking your code down into these smaller chunks is called refactoring.

The concept of a line of code is surprisingly fluid in R. Since you can add whitespace more or less where you like, the same code can take up one line in your editor of hundreds, if you so choose. Assuming that most programmers will write in a reasonably standard way, we can get a rough idea of how many lines there are in an R function by calling `deparse` on its body. `deparse` is less scary than it sounds. Parsing means turning a load of text into something meaningful; thus deparsing means turning something meaningful into a load of text. `deparse` essentially works like `as.character` for expressions. (Actually, you can call `as.character` on expressions, but the results are often dubious.)

A very interesting question is “how much of base R could do with refactoring into smaller pieces?”. To answer this, our first task is to get all the functions.

```
fn_names <- apropos(".+")
fns <- lapply(fn_names, get)
names(fns) <- fn_names
fns <- Filter(is.function, fns)
```

`apropos` finds all the functions on your search path (i.e., from all the packages that have been loaded). Try this code with a freshly loaded version of R, and again with all your packages loaded. The function below will do that for you.

```load_all_packages <- function()
{
invisible(sapply(
rownames(installed.packages()),
require,
character.only = TRUE
))
}
```

The number of lines in each function is very straightforward to get from here.

```n_lines_in_body <- function(fn)
{
length(deparse(body(fn)))
}
n_lines <- sapply(fns, n_lines_in_body)
```

Let’s take a look at the distribution of those lengths.

```
library(ggplot2)
hist_line_count <- ggplot(data.frame(n_lines = n_lines), aes(n_lines)) +
geom_histogram(binwidth = 5)
hist_line_count
```

So about half the functions are five lines or less, which is all well and good. Notice that the x-axis extends all the way over to 400 though, so there clearly are some monsters in there.

```head(sort(n_lines, decreasing = TRUE))

library         arima   help.search        coplot loadNamespace       plot.lm
409           328           320           316           305           299
```

So `library` is the number one culprit for being over long and complicated. In fairness to it though, it does mostly consist of sub-functions, so there clearly has been a lot of refactoring done on it; its just that the individual bits are contained within it rather than elsewhere. `arima` is more of a mess; it looks like the code is so old that no-one dare touch it anymore. None of these functions are really bad though. To see a package that really need some refactoring work, load up `Hmisc` and rerun this analysis. Now eight of the top ten longest functions come from this package. Quick challenge for you: hunting through other packages, can you find a function that beats Hmisc’s transcan at 591 lines?

Tags: ,