## Presenting Immer’s barley data

Last time I talked about adapting graphs for presentations. This time I’m putting some of the concepts I discussed there into action, with a presentation of Immer’s barley dataset. This is a classic dataset, originally published in 1934; in 1993 Bill Cleveland mentioned it in his book Visualising Data on account of how it may contain an error. Here’s the paper/screen version.

Here’s the presentation.

**Immer’s barley data**http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=immersbarleydata-101031144011-phpapp01&stripped_title=immers-barley-data&userName=richierocks

For the record, the presentation was created with Impress and the audio recorded with Audacity. Using these tools, it’s pretty straightforward to make and share an audio presentation.

EDIT:

I’ve corrected the map slide. Some people also asked about the code to draw the plots. First up, I recoded the factors so that site and variety appear in order of increasing yield.

barley$variety <- with(barley, reorder(variety, yield)) barley$site <- with(barley, reorder(site, yield))

The plot for the graph is straightforward.

ggplot(barley, aes(yield, variety, colour = year)) + geom_point() + facet_grid(site ~ .)

The presentation version uses `facet_wrap(~ site)`

rather than `facet_grid`

, and the text size is increased with `theme_set(theme_grey(24))`

.

## Adapting graphs for presentations

I’ve just finished reading slide:ology by Nancy Duarte. It contains lots of advice about how to convey meaning through aesthetics. The book has a general/business presentation focus, but it got me wondering about how to apply the ideas in a scientific context. Since graphs from a big part of most scientific talks, and since that’s the bit I know best; that’s what I’m going to discuss here.

We start with a basic example using the ggplot2 mtcars dataset.

p_basic <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

Something I’ve been burned with recently is overestimating the size of the projector screen. Although my graphs looked great when my face was next to my monitor, the axes were hard to read when projected. A good rule of thumb for text is your font size should be half the age of the eldest member of your audience or 30 points, whichever is bigger. For graphs axes we can perhaps get away with something a little smaller. Don’t forget that by the time you’ve printed your graph to file and played with it in your presentation software, the font size you’ve picked may bear no relation to it’s final value. That said, the point remains that you need to make the text bigger.

old_theme <- theme_set(theme_grey(base_size = 24))

Reading a graph can be quite an in depth process which can overwhelm you audience. Once you’ve shown them the graph in its basic form, you should emphasis interesting features, one at a time. The one at a time thing is important – if you want your audience to be concentrating, then you shouldn’t distract them with many things at once. Here I’ve increased the size of the outliers in the bottom right hand corner.

mtcars <- within(mtcars, emph <- wt > 5) p_emph <- ggplot(mtcars) + geom_point(aes(wt, mpg, size = emph)) + opts(legend.position = "none") + scale_size_manual(values = c(2, 5))

As well as emphasising values, a useful tool can be to de-emphasise other regions of the graph. Overlaying a translucent white rectangle on such regions does the trick nicely.

x_lower <- floor(min(mtcars$wt)) x_upper <- ceiling(max(mtcars$wt)) y_lower <- floor(min(mtcars$mpg)) -1 y_upper <- ceiling(max(mtcars$mpg)) +1 p_deemph <- p_emph + geom_rect( aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), data = data.frame( xmin = c(x_lower, 5), xmax = c(5, x_upper), ymin = c(y_lower, 16), ymax = rep.int(y_upper, 2) ), fill = "white", alpha = 0.6 ) + scale_x_continuous(limits = c(x_lower, x_upper), expand = c(0, 0)) + scale_y_continuous(limits = c(y_lower, y_upper), expand = c(0, 0))

For more deemphsising more complicated regions, I find it easier to use dedicated image manipulation software. Paint.NET is my personal zero-cost favourite but any graphics software should suffice.

Finally, it can be useful to annotate your graph.

p_annotated <- p_deemph + geom_text( aes(x = x, y = y, label = label), data = data.frame(x = 5, y = 13, label = "outliers!"), size = 14 )

By now, if your audience hasn’t figured out which bit of the graph you want them to look at you’re in trouble. Let me know in the comments if you have any other ideas for how to draw graphs for presentations.

## Trading secrets

Recently I had the opportunity to do a job swap with one of the guys in the laboratory here at HSL. I helped out with the mass-spectrometry and James helped me with the data analysis. Two very useful things came out of this.

Firstly, it’s been very informative to see how the data I get is created. I tend to assume that the numbers that are given to me are either correct or mistakes. The reality though is more subtle. One thing at surprised me was the length that the chemists have to go to to make sure that their instruments give sensible answers. As well as testing urine samples, you need to test blank samples (to clean out the spectrometer’s tubes), standard samples (to calibrate the machine) and quality control samples (to check that the calibration is correct). Even then, it wasn’t entirely clear that you would get the same answer if you ran the samples twice.

The project was based around testing Thallium levels in the general population. To give an idea of how much we could trust the data, I re-analysed 50 of the samples that James had run. The tricky bit was the pipetting; there’s a surprising art to avoiding air bubbles.

As you can see, my results were consistently lower than James’s. Taking James as the gold standard in mass-spectrometry skill and myself as the worst-case scenario, you can see that we should only trust the results to the nearest order of magnitude. This is not a trivial exercise – it demonstrates what would happen if James is replaced by an idiot. (All too possible, depending on what George Osbourne says later today.)

The second really good thing to come out of this was that I managed to drill into James the importance of manipulating data with code instead of manually editing spreadsheets. He in turn passed on this message when we presented our findings to the lab. (Main finding: no-one is about to die of thalium poisoning.) After the presentation, one of our toxicologists came up to me and said

“I finally get it. I understand why mathematicians keep saying that you shouldn’t use Excel. It’s because in order to for your work to be reproducible and auditable, you need the trail of code to see what you’ve done.”

Major win.

## Two amigos: follow up

Brett and Jiro have announced the results of the competition to make a Bob-free image. There were five entries, two prizes and … I didn’t win either. Still, it was a fun challenge and a useful learning experience so I’m consoling myself with cliches like “it’s not the winning that’s important but the taking part”. I’m certainly not using MATLAB to construct a voodoo-doll image of Brett and Jiro.

%% Read in image and display theAmigos = imread('the amigos better blur.jpg'); image(theAmigos) %% Add lines pinColour = [.5 .5 .5]; xcoords = { ... [130 180] ... [132 182] ... [136 184] ... [140 186] ... [148 190] ... [165 195] ... [182 200] ... [200 205] ... [215 214] ... [230 223] ... [243 228] ... [255 234] ... [270 237] ... [283 244] ... [295 247] ... [300 246] ... [303 248] ... ... [465 515] ... [465 516] ... [466 517] ... [469 519] ... [475 522] ... [487 526] ... [505 534] ... [528 540] ... [548 546] ... [567 551] ... [588 554] ... [606 557] ... [621 560] ... [628 563] ... [633 566] ... [633 567] ... [634 568] ... }; ycoords = { ... [295 300] ... [275 290] ... [260 280] ... [240 275] ... [225 274] ... [220 274] ... [215 273] ... [212 273] ... [212 273] ... [214 273] ... [217 274] ... [221 274] ... [230 275] ... [240 277] ... [250 280] ... [275 285] ... [290 292] ... ... [320 322] ... [305 315] ... [288 310] ... [272 304] ... [253 300] ... [240 296] ... [233 292] ... [230 291] ... [230 291] ... [232 292] ... [236 294] ... [246 297] ... [262 300] ... [280 302] ... [296 307] ... [309 312] ... [320 320] ... }; xstart = cellfun(@(x) x(1), xcoords); ystart = cellfun(@(x) x(1), ycoords); hold on cellfun(@(x, y) line(x, y, 'Color', pinColour), xcoords, ycoords); arrayfun(@(x, y) plot(x, y, '.', 'Color', pinColour), xstart, ystart); hold off %% Remove the extra bits created by plot calls and write to file set(gca, 'Visible', 'off') set(gca, 'Position', [0 0 1 1]) print(gcf, '-djpeg', 'the amigos voodoo.jpg')

## Two amigos MATLAB contest

Today I discovered a MATLAB mini-contest called The Two Amigos. The idea is two use MATLAB to remove Bob from a photo of the three Pick-of-the-Week bloggers. The contest officially closed last week but they had no entries by submission day so you’re still in with a chance, if you’re quick.

I hadn’t done any image processing in MATLAB until earlier today, and I don’t have access to the image processing toolbox, so my attempt is pretty basic. I’m posting my submission here to give you a headstart. As with all the other code on this blog, it is licensed under the WTFPL, so you can literally do “what the f*ck” you want with it. ( If you submit something based upon my code though, an attribution would be appreciated.)

First up: reading and displaying an image in MATLAB is easy.

theAmigos = imread('threeamigos-800w.jpg'); image(theAmigos)

My first idea was to simply place a black rectangle over a region roughly corresponding to Bob.

% Take a copy of the basic image theAmigosBlackout = theAmigos; % Select 'Bob' region bobRectX = 220:563; bobRectY = 300:470; % Make this region black theAmigosBlackout(bobRectX, bobRectY, : ) = 0; image(theAmigosBlackout)

Hmm. The black region looks a little severe. It would be easier on the eye to simply blur him out. My homemade blur technique uses the `filter2`

function to create a moving average filter (a very simple smoother) to blur the region.

theAmigosBlur = theAmigos; blurRadius = 20; % Calculate weights for the blur filter w = [(blurRadius + 1):-1:1 2:(blurRadius + 1)]; w2 = repmat(w, length(w), 1); weights = 1 ./ (w2 + w2'); weights = weights / sum(weights(:)); % Apply filter to each colour channel for i = 1:3 theAmigosBlur(bobRectX, bobRectY, i) = ... filter2(weights, theAmigosBlur(bobRectX, bobRectY, i)); end image(theAmigosBlur)

The problem here is that the edges of the blurred region look darker. Presumably missing values are considered as black, for some reason. To solve this we only use the central “valid” region of the filter. This involves some fiddling to extend the rectangle’s region, which in turn involves some fiddling to extend the base of the image.

theAmigosBetterBlur = theAmigos; % Temporarily extend image bottom bottom = repmat(theAmigosBetterBlur(end, :, : ), blurRadius, 1); theAmigosBetterBlur = [theAmigosBetterBlur; bottom]; % Extend rectangle extendedX = (min(bobRectX) - blurRadius):(max(bobRectX) + blurRadius); extendedY = (min(bobRectY) - blurRadius):(max(bobRectY) + blurRadius); % Apply filter to valid region for i = 1:3 theAmigosBetterBlur(bobRectX, bobRectY, i) = ... filter2(weights, theAmigosBetterBlur(extendedX, extendedY, i), 'valid'); end % Remove the extension to the bottom of the image theAmigosBetterBlur = theAmigosBetterBlur(1:(end - blurRadius), :, : ); image(theAmigosBetterBlur)

Can you do better than this? Maybe you can figure out how to do edge detection, or find a better way to find the ‘Bob’ region? Can you think of a more appropriate substitute than blurring? Maybe a few MATLAB logos in there might swing the judges decision. Or you could recreate the special effect used when they transport people on Star Trek.

Let me know how you get on.

## Fun, fun fun! (array, cell and struct)

The other day I was asked what the point of MATLAB’s `cellfun`

function was. “Surely I can do what it does with a for loop?”, they said. The quick answer is, “yes you can use a for loop, but it’s still very useful”. This post tells you why.

The three functions `arrayfun`

, `cellfun`

and `structfun`

are for solving split-apply-combine problems. That is, you split your data up into chunks, you apply some function to each chunk, then you combine the results together. The difference between them is the type of input variable that they accept: `arrayfun`

takes arrays, and so forth. (It should be noted that having just three functions for this is very restrained. R has half a dozen functions in the `apply`

family, plus `aggregate`

and `by`

, not to mention the `plyr`

package.)

A simple example is to try and get the number of characters in each string of a cell array. We start by defining some data (in this case, the first four metasyntactic variables).

msv = {'foo' 'bar' 'baz' 'quux'};

Now compare using a for loop

n = zeros(size(msv)); for i = 1:numel(msv) n(i) = numel(msv{i}); end

with cellfun

n = cellfun(@numel, msv) %The @ symbol denotes a handle to numel

Aside from the obvious benefit that we’ve hugely cut down on the amount of typing, I think the second method expresses the *intent* of the code much more clearly. The use of `cellfun`

means that you must have a split-apply-combine problem, whereas for loops are more general concepts, so you need to study the code more closely to understand what is happening.

There is one little niggle with `cellfun`

that I hope The MathWorks will correct one day. If the result of applying the function to each chunk can have a different size, then the result needs to be stored in a cell array rather than a vector. In this case, you need to explicitly set `'UniformOutput'`

to `false`

. Ideally, `cellfun`

would be able to automatically know when the output doesn’t have uniform size and act appropriately. Keep your fingers crossed that this gets sorted eventually.