Archive for October, 2010

Presenting Immer’s barley data

31st October, 2010 2 comments

Last time I talked about adapting graphs for presentations.  This time I’m putting some of the concepts I discussed there into action, with a presentation of Immer’s barley dataset.  This is a classic dataset, originally published in 1934; in 1993 Bill Cleveland mentioned it in his book Visualising Data on account of how it may contain an error.  Here’s the paper/screen version.

Immer's barley dataset, optimised for paper

Here’s the presentation.

For the record, the presentation was created with Impress and the audio recorded with Audacity. Using these tools, it’s pretty straightforward to make and share an audio presentation.


I’ve corrected the map slide. Some people also asked about the code to draw the plots. First up, I recoded the factors so that site and variety appear in order of increasing yield.

barley$variety <- with(barley, reorder(variety, yield)) 
barley$site <- with(barley, reorder(site, yield))

The plot for the graph is straightforward.

ggplot(barley, aes(yield, variety, colour = year)) +
  geom_point() +
  facet_grid(site ~ .)

The presentation version uses facet_wrap(~ site) rather than facet_grid, and the text size is increased with theme_set(theme_grey(24)).

Tags: , ,

Adapting graphs for presentations

29th October, 2010 6 comments

I’ve just finished reading slide:ology by Nancy Duarte. It contains lots of advice about how to convey meaning through aesthetics. The book has a general/business presentation focus, but it got me wondering about how to apply the ideas in a scientific context.  Since graphs from a big part of most scientific talks, and since that’s the bit I know best; that’s what I’m going to discuss here.

We start with a basic example using the ggplot2 mtcars dataset.

p_basic <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

A scatterplot, designed for the screen or page

Something I’ve been burned with recently is overestimating the size of the projector screen. Although my graphs looked great when my face was next to my monitor, the axes were hard to read when projected. A good rule of thumb for text is your font size should be half the age of the eldest member of your audience or 30 points, whichever is bigger. For graphs axes we can perhaps get away with something a little smaller. Don’t forget that by the time you’ve printed your graph to file and played with it in your presentation software, the font size you’ve picked may bear no relation to it’s final value. That said, the point remains that you need to make the text bigger.

old_theme <- theme_set(theme_grey(base_size = 24))

Increased text size

Reading a graph can be quite an in depth process which can overwhelm you audience. Once you’ve shown them the graph in its basic form, you should emphasis interesting features, one at a time. The one at a time thing is important – if you want your audience to be concentrating, then you shouldn’t distract them with many things at once. Here I’ve increased the size of the outliers in the bottom right hand corner.

mtcars <- within(mtcars, emph <- wt > 5)

p_emph <- ggplot(mtcars) +
  geom_point(aes(wt, mpg, size = emph)) +
  opts(legend.position = "none") +
  scale_size_manual(values = c(2, 5))

The outliers on the right are emphsised

As well as emphasising values, a useful tool can be to de-emphasise other regions of the graph. Overlaying a translucent white rectangle on such regions does the trick nicely.

x_lower <- floor(min(mtcars$wt))
x_upper <- ceiling(max(mtcars$wt))
y_lower <- floor(min(mtcars$mpg))  -1
y_upper <- ceiling(max(mtcars$mpg)) +1

p_deemph <- p_emph +
    aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
    data = data.frame(
      xmin = c(x_lower, 5),
      xmax = c(5, x_upper),
      ymin = c(y_lower, 16),
      ymax =, 2)
    fill = "white", alpha = 0.6
  ) +
  scale_x_continuous(limits = c(x_lower, x_upper), expand = c(0, 0)) +
  scale_y_continuous(limits = c(y_lower, y_upper), expand = c(0, 0))

The rest of the plot is de-emphasised by a translucent white covering

For more deemphsising more complicated regions, I find it easier to use dedicated image manipulation software. Paint.NET is my personal zero-cost favourite but any graphics software should suffice.

Finally, it can be useful to annotate your graph.

p_annotated <- p_deemph +
    aes(x = x, y = y, label = label),
    data = data.frame(x = 5, y = 13, label = "outliers!"),
    size = 14

Text annotation can assist the focus too

By now, if your audience hasn’t figured out which bit of the graph you want them to look at you’re in trouble. Let me know in the comments if you have any other ideas for how to draw graphs for presentations.

Tags: ,

Trading secrets

20th October, 2010 3 comments

Recently I had the opportunity to do a job swap with one of the guys in the laboratory here at HSL.  I helped out with the mass-spectrometry and James helped me with the data analysis.  Two very useful things came out of this.

Firstly, it’s been very informative to see how the data I get is created.  I tend to assume that the numbers that are given to me are either correct or mistakes.  The reality though is more subtle.  One thing at surprised me was the length that the chemists have to go to to make sure that their instruments give sensible answers.  As well as testing urine samples, you need to test blank samples (to clean out the spectrometer’s tubes), standard samples (to calibrate the machine) and quality control samples (to check that the calibration is correct).  Even then, it wasn’t entirely clear that you would get the same answer if you ran the samples twice.

The project was based around testing Thallium levels in the general population.  To give an idea of how much we could trust the data, I re-analysed 50 of the samples that James had run.  The tricky bit was the pipetting; there’s a surprising art to avoiding air bubbles.

Comparison of Thallium levels between Richie and James' measurements

As you can see, my results were consistently lower than James’s.  Taking James as the gold standard in mass-spectrometry skill and myself as the worst-case scenario, you can see that we should only trust the results to the nearest order of magnitude.  This is not a trivial exercise – it demonstrates what would happen if James is replaced by an idiot.  (All too possible, depending on what George Osbourne says later today.)

The second really good thing to come out of this was that I managed to drill into James the importance of manipulating data with code instead of manually editing spreadsheets.  He in turn passed on this message when we presented our findings to the lab.  (Main finding: no-one is about to die of thalium poisoning.)  After the presentation, one of our toxicologists came up to me and said

“I finally get it.  I understand why mathematicians keep saying that you shouldn’t use Excel.  It’s because in order to for your work to be reproducible and auditable, you need the trail of code to see what you’ve done.”

Major win.

Two amigos: follow up

13th October, 2010 Leave a comment

Brett and Jiro have announced the results of the competition to make a Bob-free image. There were five entries, two prizes and … I didn’t win either. Still, it was a fun challenge and a useful learning experience so I’m consoling myself with cliches like “it’s not the winning that’s important but the taking part”. I’m certainly not using MATLAB to construct a voodoo-doll image of Brett and Jiro.
jiro and brett with pins in their heads

%% Read in image and display
theAmigos = imread('the amigos better blur.jpg');

%% Add lines
pinColour = [.5 .5 .5];

xcoords = { ...
   [130 180] ...
   [132 182] ...
   [136 184] ...
   [140 186] ...
   [148 190] ...
   [165 195] ...
   [182 200] ...
   [200 205] ...
   [215 214] ...
   [230 223] ...
   [243 228] ...
   [255 234] ...
   [270 237] ...
   [283 244] ...
   [295 247] ...
   [300 246] ...
   [303 248] ...
   [465 515] ...
   [465 516] ...
   [466 517] ...
   [469 519] ...
   [475 522] ...
   [487 526] ...
   [505 534] ...
   [528 540] ...
   [548 546] ...
   [567 551] ...
   [588 554] ...
   [606 557] ...
   [621 560] ...
   [628 563] ...
   [633 566] ...
   [633 567] ...
   [634 568] ...

ycoords = { ...
   [295 300] ...
   [275 290] ...
   [260 280] ...
   [240 275] ...
   [225 274] ...
   [220 274] ...
   [215 273] ...
   [212 273] ...
   [212 273] ...
   [214 273] ...
   [217 274] ...
   [221 274] ...
   [230 275] ...
   [240 277] ...
   [250 280] ...
   [275 285] ...
   [290 292] ...
   [320 322] ...
   [305 315] ...
   [288 310] ...
   [272 304] ...
   [253 300] ...
   [240 296] ...
   [233 292] ...
   [230 291] ...
   [230 291] ...
   [232 292] ...
   [236 294] ...
   [246 297] ...
   [262 300] ...
   [280 302] ...
   [296 307] ...
   [309 312] ...
   [320 320] ...

xstart = cellfun(@(x) x(1), xcoords);
ystart = cellfun(@(x) x(1), ycoords);

hold on
cellfun(@(x, y) line(x, y, 'Color', pinColour), xcoords, ycoords);
arrayfun(@(x, y) plot(x, y, '.', 'Color', pinColour), xstart, ystart);
hold off

%% Remove the extra bits created by plot calls and write to file
set(gca, 'Visible', 'off')
set(gca, 'Position', [0 0 1 1])

print(gcf, '-djpeg', 'the amigos voodoo.jpg')

Creating GUIs in R with gWidgets

6th October, 2010 14 comments

The gWidgets framework is a way of creating graphical user interfaces in a toolkit independent way.  That means that you can choose between tcl/tk, Gtk, Java or Qt underneath the bonnet. There’s also a web-version based upon RApache and ExtJS. Since the code is the same in each case, you can change your mind and swap toolkits at a later date, without having to rewrite everything.  Different versions of the toolkit are in different states of development; Gtk is the most complete, but the tcl/tk and Java versions are usable. The Web version has had a recent rewrite, which I haven’t used so I can’t vouch for it’s status.  Finally, the Qt version is still experimental (and not yet available on CRAN). Personally, I use the tcl/tk version, since all the necessary components ship with the Windows edition of R.

The framework is  fairly high level, making it quick for prototyping user interfaces.  The drawback is that you don’t get quite as much control over the styling of your interface.  If you need finer control, you may prefer one of the lower level packages: RGtk2, tcltk or rJava.  In those cases, you will lose the toolkit independence.

To learn how gWidgets works, we’ll build a dialog box with controls to upload a tab delimited file. To begin, we load the necessary packages.

library(gWidgetstcltk) #or gWidgetsGtk2 or gWidgetsrJava or gWidgetsWWW or gWidgetsQt

The textboxes and checkboxes and so forth that we need are known as widgets (hence “gWidgets”). They need to be contained inside a window, which we create using the function gwindow.

win <- gwindow("Tab delimited file upload example")

By default, the widgets will be stacked up vertically. We can create groups of widgets that are stacked horizontally with ggroup(which is a widget in itself). Notice that all widgets must specify their container; in this case it’s just the window.

grp_name <- ggroup(container = win)

A glabel is a widget that represents a text label. Notice that it is contained inside the group we just created.

lbl_data_frame_name <- glabel(
  "Variable to save data to: ",
  container = grp_name

A gedit is a single line textbox. (Not to be confused with a gtext, which is a multiline textbox.)

txt_data_frame_name <- gedit("dfr", container = grp_name)

Another horizontal group, for the upload button.

grp_upload <- ggroup(container = win)

For widgets that we want to respond to an action, we need to add a handler argument. This is always a function accepting a list as its first argument (named h by convention), and dots. The gbutton handler is called whenever the button is clicked. Don’t worry about the contents of the handler function for now; we’ll add them in a moment.

btn_upload <- gbutton(
  text      = "Upload tab delimited file",
  container = grp_upload,
  handler   = function(h, ...)
    # TODO!

Since tab delimited files can have decimal places represented as full-stops or commas (depending upon local conventions), we need a checkbox to choose between cases. We define a function to get the default value from the system locale settings. Conveniently, checkboxes have their own label built-in so we don’t need to create our own this time.

use_comma_for_decimal <- function()
  unname(Sys.localeconv()["decimal_point"] == ",")

chk_eurostyle <- gcheckbox(
  text      = "Use comma for decimal place",
  checked   = use_comma_for_decimal(),
  container = grp_upload

The last widget we’ll include is a status bar, so that users don’t have to refer back to the R main window for messages.

status_bar <- gstatusbar("", container = win)

Finally, here’s the content for the button handler. It creates a file open dialog box, which in turn has its own handler function. The action argument names the function to be applied to the file that is opened. The svalue function returns the “most useful thing” from a widget. For a checkbox, the svalue is whether or not it is checked. For a textbox or status bar, the svalue is its text. The filter argument populates the “Files of type” drop down list in the file open dialog.

function(h, ...)
    text    = "Upload tab delimited file",
    type    = "open",
    action  = ifelse(svalue(chk_eurostyle), "read.delim2", "read.delim"),
    handler = function(h, ...)
          data_frame_name <- make.names(svalue(txt_data_frame_name))
          the_data <-$action, list(h$file))
          assign(data_frame_name, the_data, envir = globalenv())
          svalue(status_bar) <-
            paste(nrow(the_data), "records saved to variable", data_frame_name)
        error = function(e) svalue(status_bar) <- "Could not upload data"
    filter = list(
      "Tab delimited" = list(patterns = c("*.txt","*.dlm","*.tab")),
      "All files" = list(patterns = c("*"))

And there we have it.

The file upload dialog box we created

If you’re feeling enthusiastic, see if you can adapt this to work with CSV files, or even better a general delimited file.

One last trick to finish the post: You can create a GUI interface to any function using ggenericwidget. Try

lmwidget <- ggenericwidget(lm)
Tags: , ,

Two amigos MATLAB contest

6th October, 2010 1 comment

Today I discovered a MATLAB mini-contest called The Two Amigos. The idea is two use MATLAB to remove Bob from a photo of the three Pick-of-the-Week bloggers. The contest officially closed last week but they had no entries by submission day so you’re still in with a chance, if you’re quick.

I hadn’t done any image processing in MATLAB until earlier today, and I don’t have access to the image processing toolbox, so my attempt is pretty basic. I’m posting my submission here to give you a headstart. As with all the other code on this blog, it is licensed under the WTFPL, so you can literally do “what the f*ck” you want with it.  ( If you submit something based upon my code though, an attribution would be appreciated.)

First up: reading and displaying an image in MATLAB is easy.

theAmigos = imread('threeamigos-800w.jpg');

My first idea was to simply place a black rectangle over a region roughly corresponding to Bob.

% Take a copy of the basic image
theAmigosBlackout = theAmigos;

% Select 'Bob' region
bobRectX = 220:563;
bobRectY = 300:470;

% Make this region black
theAmigosBlackout(bobRectX, bobRectY, : ) = 0;


Bob is blacked out from the photo of the three amigos

Hmm. The black region looks a little severe. It would be easier on the eye to simply blur him out. My homemade blur technique uses the filter2 function to create a moving average filter (a very simple smoother) to blur the region.

theAmigosBlur = theAmigos;
blurRadius = 20;

% Calculate weights for the blur filter
w = [(blurRadius + 1):-1:1 2:(blurRadius + 1)];
w2 = repmat(w, length(w), 1);
weights = 1 ./ (w2 + w2');
weights = weights / sum(weights(:));

% Apply filter to each colour channel
for i = 1:3
   theAmigosBlur(bobRectX, bobRectY, i) = ...
      filter2(weights, theAmigosBlur(bobRectX, bobRectY, i));

Bob is blurred in the photo of the three amigos

The problem here is that the edges of the blurred region look darker. Presumably missing values are considered as black, for some reason. To solve this we only use the central “valid” region of the filter. This involves some fiddling to extend the rectangle’s region, which in turn involves some fiddling to extend the base of the image.

theAmigosBetterBlur = theAmigos;

% Temporarily extend image bottom
bottom = repmat(theAmigosBetterBlur(end, :, : ), blurRadius, 1);
theAmigosBetterBlur = [theAmigosBetterBlur; bottom];

% Extend rectangle
extendedX = (min(bobRectX) - blurRadius):(max(bobRectX) + blurRadius);
extendedY = (min(bobRectY) - blurRadius):(max(bobRectY) + blurRadius);

% Apply filter to valid region
for i = 1:3
   theAmigosBetterBlur(bobRectX, bobRectY, i) = ...
      filter2(weights, theAmigosBetterBlur(extendedX, extendedY, i), 'valid');

% Remove the extension to the bottom of the image
theAmigosBetterBlur = theAmigosBetterBlur(1:(end - blurRadius), :, : );


The edges to the blur of Bob are better now

Can you do better than this? Maybe you can figure out how to do edge detection, or find a better way to find the ‘Bob’ region? Can you think of a more appropriate substitute than blurring? Maybe a few MATLAB logos in there might swing the judges decision. Or you could recreate the special effect used when they transport people on Star Trek.

Let me know how you get on.

Tags: ,

Fun, fun fun! (array, cell and struct)

3rd October, 2010 Leave a comment

The other day I was asked what the point of MATLAB’s cellfun function was. “Surely I can do what it does with a for loop?”, they said. The quick answer is, “yes you can use a for loop, but it’s still very useful”. This post tells you why.

The three functions arrayfun, cellfun and structfun are for solving split-apply-combine problems. That is, you split your data up into chunks, you apply some function to each chunk, then you combine the results together. The difference between them is the type of input variable that they accept: arrayfun takes arrays, and so forth. (It should be noted that having just three functions for this is very restrained. R has half a dozen functions in the apply family, plus aggregate and by, not to mention the plyr package.)

A simple example is to try and get the number of characters in each string of a cell array. We start by defining some data (in this case, the first four metasyntactic variables).

msv = {'foo' 'bar' 'baz' 'quux'};

Now compare using a for loop

n = zeros(size(msv));
for i = 1:numel(msv)
   n(i) = numel(msv{i});

with cellfun

n = cellfun(@numel, msv)      %The @ symbol denotes a handle to numel 

Aside from the obvious benefit that we’ve hugely cut down on the amount of typing, I think the second method expresses the intent of the code much more clearly. The use of cellfun means that you must have a split-apply-combine problem, whereas for loops are more general concepts, so you need to study the code more closely to understand what is happening.

There is one little niggle with cellfun that I hope The MathWorks will correct one day. If the result of applying the function to each chunk can have a different size, then the result needs to be stored in a cell array rather than a vector. In this case, you need to explicitly set 'UniformOutput' to false. Ideally, cellfun would be able to automatically know when the output doesn’t have uniform size and act appropriately. Keep your fingers crossed that this gets sorted eventually.

Tags: ,

Update to shortcut tools

1st October, 2010 Leave a comment

I’ve made a small update to the MATLAB shortcut tools collection in the File Exchange. (See the previous post A Shortcut to Success for advice on usage.) The main change is that I’ve been persuaded of the virtue of hiding code in private folders, after finding myself with an increasingly mangled search path. AddShortcutCategory has been replaced with AddShortcutCategories, since you’ll naturally want to add a few at a time, and I’ve added the IconDir function which simply returns the name of the directory of the icons that ship with MATLAB since R2009a. It should be available in the FEX shortly.

Tags: ,

MATLAB Conference 2010

1st October, 2010 Leave a comment

I went to the MATLAB conference at Wembley Stadium yesterday. There were two presentations and a lot of discussion about parallel computing (more on that later). One of the most interesting things to me was that The MathWorks have gotten command history files from a bunch of students learning MATLAB, which they are using to weed out some of the problems that users get when they first learn the language. I can proudly report that I’ve been bitten by several of the examples that were presented.

round 1.234

ans =
    49    46    50    51    52

Here, the input is converted to a character array, then returns the ascii values of each character.   (round is being altered to throw an error upon character input.)

Likewise the unhelpful error message that occurs when you mess up a multiplication is being revamped.

eye(3) * magic(4)
??? Error using ==> mtimes
Inner matrix dimensions must agree.

mtimes is the underlying function that * calls. The error message should now display *, as expected.

I’m really glad to see that The MathWorks are taking in interest in cleaning up these niggles. Little improvements can save users hours of frustration. It reminds me of Canonical’s 100 paper cuts project for Ubuntu.

The parallel computing discussions focussed on two areas: gpu computing and scaling from a single core to multiple cores on a single machine, through to clusters of machines. The gpu and mulicore cases are dealt with via the parallel computing toolbox; scaling to multiple machines requires the rather-more-expensive distributed computing toolbox.

Both products seem to be in a teenage-state: mature enough to get some useful things done with them, but missing a few features.   For example, it’s really easy to parallelise a for loop: you simply replace for with parfor, but arrayfun only works in parallel with gpus, not cpus.   (John Walley, an application engineer with The MathWorks, is discussing rectifying this.) In fairness to The Mathworks, parallel computing from your desktop is in its infancy everywhere.

If I successfully bat my eyelids at the people in charge of the software budget, I’m hoping to be able to play with the parallel computing toolbox in the near future;  I’ll give you some code examples once that happens.