Archive for August, 2010

Sweet bar chart o’ mine

30th August, 2010 2 comments

Last week I was asked to visualise some heart rate data from an experiment. The experimentees were clothed in protective suits and made to do a bunch of exercises while various physiological parameters were measured. Including “deep body temperature”. Gross. The heart rates were taken every five minutes over the two and a half hour period. Here’s some R code to make fake data for you to play with. The heart rates rise as the workers are made to do exercise, and fall again during the cooling down period, but it’s a fairly noisy series.
interval <- 5
heart_data <- data.frame(
+++time =, 150, interval)
n_data <- nrow(heart_data)
frac_n_data <- floor(.7 * n_data)
heart_data$rate = runif(n_data, 50, 80) +
+++c(, 50, length.out = frac_n_data),, 0, length.out = n_data - frac_n_data)
heart_data$lower <- heart_data$rate - runif(n_data, 10, 30)
heart_data$upper <- heart_data$rate + runif(n_data, 10, 30)

The standard way of displaying a time series (that is, a numeric variable that changes over time) is with a line plot. Here’s the ggplot2 code for such a plot.

plot_base <- ggplot(heart_data, aes(time, rate))
plot_line <- plot_base + geom_line()

Line plot of heart rate dataUsing a line isn’t always appropriate however. If you have missing data, or the data are irregular or infrequent, then it is misleading to join them together with a line. Other things are happening during the times that you have no data for. ggplot2 will automatically removes lines that have a missing value between them (as represented by NA values) but in the case of irregular/infrequent data you don’t want any lines at all. In this case, using points rather than lines is the best option, effectively creating a scatterplot.

plot_point <- plot_base + geom_point()

Scatter plot of heart rate dataSince heart rate can change dramatically over the course of five minutes, the data generated by the experiment should be considered infrequent, and so I opted for the scatterplot approach.

The experimenters, however, wanted a bar chart.

plot_bar <- plot_base +
+++geom_bar(aes(factor(time), rate), alpha = 0.7) +
+++opts(axis.text.x = theme_text(size = 8))

Bar chart of heart rate dataI hadn’t considered this use of a bar chart before, so it was interesting to think about the pros and cons relative to using points. First up, the bar chart does successfully communicate the numeric values, and the fact they they are discrete. The big difference is that the bars are forced to stretch down to zero, squeezing the data into a small range near the top of the plot. Whether or not you think this is a good thing depends upon the questions you want to answer about the heart rates.

If you want to be able to say “the maximum heart rate was twice as fast as the minimum heart rate”, then bars are great for this. Comparing lengths is what bars are made for. If on the other hand, you want to focus on the relative differences between data (“how much does the heart rate go up by when the subject did some step-ups?”), then points make more sense, since you are zoomed in to the range of the data.

There are a couple of other downsides to using a bar chart. Bars have a much lower data-ink ratio than points. Further, if we want to add a confidence region to the plot, it gets very busy with bars. Compare

plot_point_region <- plot_point +
++++++x = time, xend = time, y = lower, yend = upper),
++++++size = 2, alpha = .4)

plot_bar_region <- plot_bar +
++++++x = as.numeric(factor(time)),
++++++xend = as.numeric(factor(time)),
++++++y = lower,
++++++yend = upper), size = 2, colour = "grey30")

Bar chart of heart rate data, with confidenceScatter plot of heart rate data, with confidenceThe big deal-breaker for me is that a bar chart seems semantically wrong. Bar charts are typically used to visualise a numeric variable split over several categories. This isn’t the case here: time is not categorical.

Something about this analysis was bugging me though, and I started wondering “Is it ever appropriate to use bars in a time series?”. Last night, as I was watching Guns ‘N’ Roses headline the Leeds Festival, the answer came to me. GNR were at least an order of magnitude more awesome than expected, but damn, some of those power ballads go on a long time, which allowed my mind to wander. Here’s their set list, with song lengths. (Solos and instrumentals omitted, and I wasn’t standing there with a stopwatch so data are taken from the album versions.)

songs <- c(
+++"Chinese Democracy",
+++"Welcome To The Jungle",
+++"It's So Easy",
+++"Mr. Brownstone",
+++"Live And Let Die",
+++"This I Love",
+++"Rocket Queen",
+++"Street Of Dreams",
+++"You Could Be Mine",
+++"Sweet Child O' Mine",
+++"November Rain",
+++"Knockin' On Heaven's Door",
+++"Paradise City"

albums <- c(
+++"Appetite for Destruction",
+++"G 'N' R Lies",
+++"Use your Illusion I",
+++"Use your Illusion II",
+++"\"The Spaghetti Incident?\"",
+++"Chinese Democracy"

gnr <- data.frame(
+++song = ordered(songs, levels = songs),
+++length = c(283, 274, 203, 229, 374, 184, 334, 373, 286, 344, 355, 544, 336, 269, 406),
+++album = ordered(albums[c(6, 1, 1, 1, 6, 3, 6, 1, 6, 4, 1, 3, 4, 1, 1)], levels = albums)

plot_gnr <- ggplot(gnr, aes(song, length, fill = album)) +
geom_bar() +
opts(axis.text.x = theme_text(angle = 90, hjust = 1))

Guns 'N' Roses set listHere we have a “categorical time series”. The data are ordered in time, but form discrete chunks. As a bonus, the album colouring tells you which tunes have stood the test of time. In this case, the band’s debut Appetite for Destruction was played even more than the current miracle-it-arrived-at-all Chinese Democracy . G ‘N’ R Lies and “The Spaghetti Incident?”, by contrast,  didn’t feature at all.

Tags: ,

World fame awaits

28th August, 2010 Leave a comment

After accepting the paper last November, Occupational and Environmental Medicine have finally published the paper that I co-authored: Trends in blood lead levels in UK workers, 1995 to 2007.  The gist of it is that lead is nasty, toxic stuff and we wanted to know how much of it workers get exposed to.  Alongside lead, we looked at haemoglobin levels (which supposedly decreases with increased lead levels) and zinc protoporphyrin (ZPP for short, which supposedly increases in cases of lead poisoning).  The main analysis is a linear mixed effects model, which is a variation on a bog standard regression, letting us quantify how much variation in lead levels was due to differences for a given person, and how much was due to differences between people.  My favourite bit of the paper is Figure 2B.  Given a squint and a dirty mind, it’s full-on Rorschach rude.

zpp vs lead, penis-based correlation

Oh (de)bugger!

26th August, 2010 1 comment

By number of questions asked, R passed MATLAB for the first time on Stack Overflow today. Thus it seems an appropriate time to write my first R-based post.

This post concerns what to  do when your R-code goes pear shaped. Back in June there were a couple of very good videos on R debugging that came out of an R meetup in New York. Jay Emerson talked about basic debugging functions like print and browser and Harlan Harris talked about more advanced techniques like trace and debug. These R meetups sound like a great idea but I suspect that we don’t have the critical mass of R users here in Buxton, UK. I digress …

There are two obvious cases where you need to debug things. If you are lucky, you have the case of an error being thrown. It sounds like it should be the worst case, but at least you get to know where the problem is occurring; the more difficult situation is simply getting the wrong answer.

When I get an error, then traceback() is my usual instinctive response to find the location of the error. If the ggplot2 package is loaded, then tr() provides a way to save a few keystrokes. This function isn’t infallible though, and seems to have particular trouble with code in try blocks. To see this, compare

throw_error <- function() stop("!!!")


throw_error_in_try <- function() try(stop("!!!"))
traceback() #Same as before; new error did not register

In the ‘hard’ case, where you have a wrong answer rather than an error, or where traceback has let you down, you’ll have to step through your code to hunt down the problem. This entails using the debug function, or it’s graphical equivalent mtrace (from the debug package). I don’t want to spend time on those functions here (another post perhaps), so if you’re desperate to know how they work I recommend Harlan’s video tutorial.

After I’ve found the function that is causing the problem, my next step is usually to stick a call to browser in my code and rerun it. This lets me explore my environment and test out alternative code. The following example is just a wrapper for sum. The nesting gives us a call stack to look at.

first <- function(...) second(...)
second <- function(...) sum(...)
first(2, 4, "6")

The call to traceback tells us where the error is.

2: second(...)
1: first(2, 4, "6")

Let’s suppose that the error wasn’t as obvious as this. (That’s the character input, for those of you asleep at the back.) The traceback output has shown us that the error occurred in the function second. Technically, it occurred in sum, but the contents of that are C code rather than R code and traceback is smart enough to know it is of no use to us. As I described above, the next step is to call browser, just before the problem occurs.

second <- function(...)

Now when we rerun the code, execution halts at the browser() line, and we get the chance to dig about into what is going on. In this case, since all the arguments are contained in the ellipsis, we can see everything with list(...). The more common circumstance is to have at least some named arguments in your function call. In those cases ls.str() is the quickest way to see what is going on. list(...) returns

[1] 2

[1] 4

[1] “6”

The real strength of the browser function is that if the problem wasn’t obvious now, we could go on to execute additional code from within the function environment. Nevertheless, there are some limitations with the approach. The first issue is that we need to be able to get at the source code. There are two main cases where we can’t do this: firstly, when external code is called and secondly, when the code is tucked away in a package. The external case is sadly quite insoluble. I don’t know of any easy ways to debug C or Fortran code from R. For debugging in packages, if you have an error being thrown, setting options(error = recover) is an excellent alternative to adding browser statements. If there is no error thrown, then debug and trace are the best solutions, but they are beyond the scope of this post.

Another issue is that the problem with our code may not be caused in the same function that the error is thrown in. It could occur higher up the call stack. In that sort of situation, you need a way to see what is happening in all the functions that you’ve called. And that is what I’m going to talk about in part two of this post.

Tags: ,

A shortcut to success

23rd August, 2010 2 comments

The MATLAB user interface is laid out in a way so you spend a fair amount of time with one hand on your mouse.  When that happens, toolbar shortcuts are a great way to run chunks of code that you use regularly.  The trouble is, in order to set them up, you need to do lots of clicking to interact with the MATLAB UI.  This has two big problems.  Every time you install a new copy of MATLAB, you need to set up your shortcuts again.  Click, click, click.  Sigh.  It also means that shortcuts are difficult to share with your colleagues/friends/that hot chick you want to impress with your scientific coding skillz.

The solution to this is to generate the shortcuts programmatically.  Write your shortcut script once, and you can reuse the code  to your heart’s content.  The trouble is, MATLAB doesn’t ship with any tools to manipulate shortcuts.  All the details of your shortcuts (that’s the toolbar shortcuts your help browser favourites, and any other shortcuts you care to define) are stored in the file shortcuts.xml in your preferences directory (as returned by prefdir).

There are two ways to write methods to add and remove shortcuts.  You can use some XML tools, of which my favourite is the xml_io_tools package by Jaroslaw Tuszynski.  The other alternative is to delve into the mostly undocumented MATLAB UI functions.  These are mostly built out of Java Swing widgets, and while the documentation is sketchy to non existent, it is possible to utilise their features.  I don’t want to dwell on the technicalities of these widgets; if you want to know more then Yair Altman’s Undocumented Matlab blog is the best place to look.

My shortcut tools package uses this second method. It contains methods for adding, moving and removing toolbar shortcuts and browser favourites, and a few other utility functions.  I going to discuss two possible use cases for the toolbox, one really simple and another that’s slightly more advanced.

First of all, we need some code to be called by the shortcut.  This is a very simple function to tidy up your workspace:

function tidy()
%TIDY Clear workspace and command window; close plots.
close all hidden;
clear all;

MATLAB secretly ships with a set of icons; I’m using a cross, which is available from R2009a onwards. Feel free to use any icon you like, or leave the icon argument blank to use the standard icon. Adding a shortcut is as easy as calling AddShortcut.

iconPath = fullfile( ...
matlabroot, ...
   'toolbox', ...
   'shared', ...
   'dastudio', ...
   'resources', ...
AddShortcut('Tidy', 'tidy();', iconPath);

You should see the shortcut appear in your toolbar. If you decide you don’t want this shortcut, you can remove it with

RemoveShortcuts([], 'Tidy');

(The first parameter requests the category of shortcut, which by default is 'Toolbar Shortcuts'. We don’t want to change this, so we can just leave it empty.)

An idea that was suggested to me by Iram Weinstein was to have different shortcuts available when you work on different projects. This requires a little more setting up, but when you can script it, it’s easy enough to do.  Let’s imagine that you like to work on both cellular automatons and 3D graphics.  Some shortcuts you use will be needed by both projects, but others will be specific to each research area.

First we need to create some categories to contain the shortcuts; one for each project and one to contain the shortcuts common to all your projects.

project1 = 'Cellular Automaton Project';
project2 = '3D Graphics Project';
common = 'Common';

Then we add shortcuts to each category, using AddShortcut, like before. You’ll probably want several shortcuts per project, but for now let’s stick to one shortcut for each.

AddShortcut('Life', 'life();', [], project1);
AddShortcut('Teapot', 'teapotdemo();', [], project2);
AddShortcut('Tidy', 'tidy();', iconPath, common);
AddShortcut('Cellular', 'SetupCelluarProject();', [], common);
AddShortcut('Graphics', 'SetupGraphicsProject();', [], common);

In the common section, at the very least you need a shortcut to initialize each project. All that is left is to write the callback functions for those shortcuts.

function SetupCellularProject()
%SETUPCELLULARPROJECT Sets up the Cellular Automaton Project.

% Get rid of work from other projects.

% Clear the existing toolbar

% Copy the shortcuts from the common category and the project category
% to the toolbar. (The toolbar is the default target location.)
CopyShortcuts('Cellular Automaton Project');

% Other setup code goes here. You'll probably want a call to cd here to
% move to the project directory, and perhaps addpath or load as well.

SetupGraphicsProject will contain similar contents. By encapsulating this setup code into a function, it means that you can call it from the command line to get going with your project as well. As a lazy typist I tend to use shorter function names for these, or at least have a short alias.

That’s it. You can now update your shortcuts at the click of a button.


20th August, 2010 Leave a comment

Anyone who knows anything about presenting data will tell you that pie charts are rubbish.  Even their idiot cousins will tell you that 3D pie charts are even worse.  That got me thinking, “how bad can it get?”.  Sure you can choose garish colour schemes and glossy reflective coatings and transparency, but that has been done before.  By almost every two-bit data-viz charlatan in existence, judging by a Google Image search.  To break new ground in awfulness, we need to enter (cue atmospheric intro)… a new dimension!  In fact, we’re going to make the pie chart travel through time (woo!) with an animation.

The MathWorks, to their eternal shame, include a 3D pie chart function, pie3.  By combining it with their rotate function, we can enter into new levels of crappiness.

First, we draw the pie chart.  This code is tweaked from the example on the pie3 help page.

x = [1 0.25 3 2.5 2 1];
explode = [0 1 0 0 0 1];
h = pie3(x, explode, repmat({''}, size(x)));

3D pie chart

Now we animate it.

n = 180;
mov(n) = getframe;
for i = 1:n
   rotate(h, [20 20], 360 / n);
   mov(i) = getframe;

MATLAB ships with the functions movie, to play back your animation, and movie2avi to export it to video.  For more web-friendlyness, we’re going to create an animated gif using movie2gif on the FEX.

movie2gif(mov, 'pie4.gif', 'DelayTime', 0.5 / n);

And there we have it.  (Click to see it in all its animated glory.)

4D Pie Chart joy

Now, having set the bar suitably low, I promise all future mosts will be more useful than this.