Archive

Posts Tagged ‘r’

GUI building in R: gWidgets vs Deducer

20th February, 2012 4 comments

I’ve been a user (and fan) of gWidgets for a couple of years now for GUI building in R. (See my introduction to it here.) However, it’s always good to check out the competition so I’ve been playing around with Deducer to see how they compare.

R can access a number of GUI building frameworks including tcltk, GTK, qt, and Java, not to mention HTML. gWidgets’ big selling point is that is provides a high-level wrapper to all the R wrappers for each framework, so you can write code in a toolkit independent way. Switching between tcltk and GTK and qt won’t often be that useful, but if you think you might want to move from a desktop based GUI to a web app, it makes the transition easier. By contrast, Deducer based upon the rJava, and provides access to the Java Swing framework. It’s a slightly lower level library (which means you have to write more lines of code to achieve the same thing), but since you get full access to Swing, it’s a little more flexible. Deducer also has some features to integrate your GUIs with JGR, so if you use that for running R, it’s perhaps the most natural choice.

To test the two frameworks, I wrote a small GUI for running the Kolmogorov-Smirnoff test (that one of the ones for checking whether or not a variable seems to have been sampled from a particular distribution). Take a look at the code below to see the comparison. (Regular reader may notice I’ve switched from my usual under_casing to camelCasing. Both the frameworks use this style, so I thought I’d follow suit for cleanliness.)

First, here are some common variables (labels and the like).

#Some sample data to test against
x1 <- rnorm(100)
x2 <- runif(100)

#Widget labels
labelX <- "Variable name for data: "
labelY <- "Distribution to compare to: "
labelAlternative <- "One or two sided test?: "
labelP <- "The p-value is: "

#Choices for comboboxes
choicesAlternative <- eval(formals(ks.test)$alternative)
distributions <- c(
    normal = pnorm, 
    exponential = pexp,
    F = pf,
    "log-normal" = plnorm,
    "Student's t" = pt,
    uniform = punif
)

This is the gWidgets GUI

createKsTestGwidgets <- function()
{
  library(gWidgetstcltk)
  options(guiToolkit = "tcltk")
  win <- gwindow("KS Test, gWidgets edition", visible = FALSE)
  
  frmX <- gframe("x", container = win)
  lblX <- glabel(labelX, container = frmX)
  txtX <- gedit(container = frmX)
  
  frmY <- gframe("y", container = win)
  lblY <- glabel(labelY, container = frmY)
  cmbY <- gcombobox(names(distributions), container = frmY)
  
  frmAlternative <- gframe("alternative", container = win)
  lblAlternative <- glabel(labelAlternative, container = frmAlternative)
  cmbAlternative <- gcombobox(choicesAlternative, container = frmAlternative)
  
  btnCalc <- gbutton("Calculate", container = win,
      handler = function(h, ...)
      {
        x <- get(svalue(txtX), mode = "numeric")
        y <- distributions[[svalue(cmbY)]]
        alternative <- svalue(cmbAlternative)
        ans <- ks.test(x, y, alternative = alternative)
        svalue(txtP) <- format(ans$p.value, digits = 3)
      }
  )
  frmResults <- gframe("results", container = win)
  lblP <- glabel(labelP, container = frmResults)
  txtP <- gedit(container = frmResults)
  visible(win) <- TRUE
  
}

createKsTestGwidgets()

…and here’s the Deducer equivalent.

createKsTestDeducer <- function()
{
  library(Deducer)
  win <- new(RDialog)
  win$setSize(300L, 500L)
  win$setTitle("KS TEST, Deducer edition")
  
  JLabel <- J("javax.swing.JLabel")
  lblX <- new(JLabel, labelX)
  addComponent(win, lblX, 1, 1000, 50, 1, rightType = "REL")
  txtX <- new(TextAreaWidget, "x")
  addComponent(win, txtX, 51, 1000, 150, 1, rightType = "REL")
  
  lblY <- new(JLabel, labelY)
  addComponent(win, lblY, 151, 1000, 200, 1, rightType = "REL")
  
  cmbY <- new(ComboBoxWidget, names(distributions))
  cmbY$setDefaultModel(names(distributions)[1])
  addComponent(win, cmbY, 201, 1000, 300, 1, rightType = "REL")
  
  lblAlternative <- new(JLabel, labelAlternative)
  addComponent(win, lblAlternative, 301, 1000, 400, 1, rightType = "REL")
  
  cmbAlternative <- new(ComboBoxWidget, choicesAlternative)
  cmbAlternative$setDefaultModel(choicesAlternative[1])
  addComponent(win, cmbAlternative, 401, 1000, 500, 1, rightType = "REL")
  
  JButton <- J("javax.swing.JButton")
  btnCalc <- new(JButton, "Calculate")
  addComponent(win, btnCalc, 501, 1000, 601, 1, rightType = "REL")
  ActionListener <- J("org.rosuda.deducer.widgets.event.RActionListener")
  listener <- new(ActionListener)
  calculationHandler <- function(cmd, ActionEvent)
  {
    x <- get(txtX$getText())
    y <- distributions[[cmbY$getModel()]]
    alternative <- cmbAlternative$getModel()
    ans <- ks.test(x, y, alternative = alternative)
    print(ans)
    txtP$setText(format(ans$p.value, digits = 3))
  }
  listener$setFunction(toJava(calculationHandler))
  btnCalc$addActionListener(listener)
  
  lblP <- new(JLabel, labelP)
  addComponent(win, lblP, 601, 1000, 650, 1, rightType = "REL")
  
  txtP <- new(TextAreaWidget, "results")
  addComponent(win, txtP, 651, 1000, 750, 1, rightType = "REL")
    
  win$run()
}

createKsTestDeducer()

Note that the Deducer example works perfectly under JGR, though I couldn’t get the button handler to fire when running it from eclipse. This is likely due to my inexperience with the toolkit rather than a fundamental problem with the framework. Many of the lines are more or less a one-to-one comparison, but Deducer requires you to explicitly specify positions of widgets, and is a little more verbose when you come to add event handling logic.

Either or these frameworks is suitable for the obvious use case of GUI building in R (rapid prototyping of front-ends for non technical users, and for teaching demos), so don’t sweat your decision too much.

Edit: Fixed variable name casing issues.

R hits 10000 questions on stackoverflow

17th February, 2012 Leave a comment

R's 10000th question on stackoverflow

A milestone, though not that exciting as questions go. Still, if you haven’t yet joined the cult of Stack Exchange, take a look here.

Tags: ,

Exploring the functions in a package

26th January, 2012 4 comments

Sometimes it can be useful to list all the functions inside a package. This is done in the same way that you would list variables in your workspace. That is, using ls. The syntax is ls(pos = "package:packagename"), which is easy enough if you can remember it. Unfortunately, I never can, and have to type search() first to see what the format of that string is.

Today, that problem is solved with a tiny utility function to save remembering things, and to save typing.

lsp <- function(package, all.names = FALSE, pattern) 
{
  package <- deparse(substitute(package))
  ls(
      pos = paste("package", package, sep = ":"), 
      all.names = all.names, 
      pattern = pattern
  )
}

all.names and pattern behave in the same way as they do in regular ls. You use it like this:

lsp(base)
lsp(base, TRUE)
lsp(base, pattern = "^is")


EDIT: I’ve had a couple of questions about the use case, and there are some interesting comments on alternatives. My thinking behind this function was that I sometimes know I’ve seen a function in a package but can’t remember what it’s called. If you can hazard a guess at the name, then apropos is probably better, though it looks everywhere on the search path rather than in a particular package. Autocompletion is also useful for this, but you need to know the first few characters of what you are looking for. (Activate autocompletions by pressing TAB in R GUI or Rstudio or CTRL+space in eclipse. I can’t remember what the shortcut is in emacs, but you probably just mash CTRL+META until you have RSI.) Finally, the unknownR package is useful for finding new functions that you hadn’t heard of yet.

Adding metadata to variables

6th January, 2012 Leave a comment

There are only really two ways to preserve your statistical analyses. You either save the variables that you create, or you save the code that you used to create them. In general the latter is much preferred because at some point you’ll realise that your model was wrong, or your dataset has changed, and you need to re-run your analysis. If you only stored your variables then you are now stuck rewriting your code in order to create new versions, which is really not fun. On the other hand, if you saved your code, all your have to do is tweak it and run it.

Occasionally though, just keeping the code and rerunning an analysis isn’t practical. The most obvious case being when it takes a long time. If your model takes more than ten minutes to run, it can be really useful to save its variables as well as the source code.

The problem with saving variables is that when you come back and load them six months later, it isn’t always obvious what they are or where they came from. With code, we solve this by using comments to jog our memory, so it would be nice to have an equivalent for variables. In fact, in R, such a facility exists with the – you guessed it – comment function.

library(lattice)
comment(barley) <- "Immer's barley data, 1934.  The data from the Morris site may have the wrong years."
comment(barley)

The comment function simply stores the string as an attribute of the variable, with some special rules on printing. Other common attributes that you may be familiar with are names for vectors and lists, and dim and dimnames for matrices.

You can find the names of all the attributes of a variable with the attributes function, and get and set individual attributes with attr.

x <- c(apple = 1, banana = 2)
attr(x, "type") <- "fruit"
attributes(x)
attr(x, "names") #same as names(x)

Attributes are really great for storing contextual metadata about a variable. For starters, when you come back to your saved workspace after those six months you might want to know who created the variable and when. To get this facility, we need an enhanced version of assign.

get_user <- function()
{
  env <- if(.Platform$OS.type == "windows") "USERNAME" else "USER"
  unname(Sys.getenv(env))    
}  
  
assign_with_metadata <- function(x, value, ..., pos = parent.frame(), inherits = FALSE)
{
  attr(value, "creator") <- get_user()
  attr(value, "time_created") <- Sys.time()
  more_attr <- list(...)
  attr_names <- names(more_attr)
  for(i in seq_along(more_attr))
  {
    attr(value, attr_names[i]) <- more_attr[[i]]
  }
  assign(x, value, pos = pos, inherits = inherits)
}

assign_with_metadata("x", 1:3, monkey = "chimp")

Notice the ... that allows you to add arbitrary attributes to the variable.

While this is great, and solves the problem, typing assign_with_metadata is way too clunky. It would be much easier if we could just use <- to assign variables and get the metadata for free.

Actually, overriding <- itself is going to lead to slowness and likely errors. Since we don’t want to store metadata for every variable (just the important ones), it is better to define our own operators to do so.

`%<-%` <- function(x, value)
{
  xname <- deparse(substitute(x))
  pos <- parent.frame()
  assign_with_metadata(xname, value, pos = pos)
}

`%<<-%` <- function(x, value) 
{
  xname <- deparse(substitute(x))
  pos <- globalenv()
  assign_with_metadata(xname, value, pos = pos)
}

m %<-% "foo"    #local assignment with metadata
f <- function()
{
  n %<<-% "bar" #global assignment with metadata
}
f()

With these functions, if you want to save your variables for later, simply swap <- for %<-%.

A quick primer on split-apply-combine problems

16th December, 2011 5 comments

I’ve just answered my hundred billionth question on Stack Overflow that goes something like

I want to calculate some statistic for lots of different groups.

Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.

First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.

head(InsectSprays)
  count spray
1    10     A
2     7     A
3    20     A
4    14     A
5    14     A
6    12     A

These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.

First, we split the count column by the spray column.

(count_by_spray <- with(InsectSprays, split(count, spray)))

Secondly, we apply the statistic to each element of the list. Lets use the mean here.

(mean_by_spray <- lapply(count_by_spray, mean))

Finally, (if possible) we recombine the list as a vector.

unlist(mean_by_spray)

This procedure is such a common thing that there are many functions to speed up the process. sapply and vapply do the last two steps together.

sapply(count_by_spray, mean)
vapply(count_by_spray, mean, numeric(1))

We can do even better than that however. tapply, aggregate and by all provide a one-function solution to these S-A-C problems.

with(InsectSprays, tapply(count, spray, mean))
with(InsectSprays, by(count, spray, mean))
aggregate(count ~ spray, InsectSprays, mean)

The plyr package also provides several solutions, with a choice of output format. ddply takes a data frame and returned another data frame, which is what you’ll want most of the time. dlply takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.

ddply(InsectSprays, .(spray), summarise, mean.count = mean(count))
dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))

You can read much more on this type of problem and the plyr solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.

One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the ave function (which provides mean as the default function).

with(InsectSprays, ave(count, spray))

Interactive graphics for data analysis

1st September, 2011 2 comments

Rocking out, reading Theus & Urbanek

I got a copy of Martin Theus and Simon Urbanek’s Interactive Graphics for Data Analysis a couple of years ago, whence it’s been sat on my bookshelf. Since I’ve recently become a self-proclaimed expert on interactive graphics I thought it was about time I read the thing. Which is exactly what I did last weekend at the Leeds Festival (in between rocking out).

It’s a book of two halves, and despite the title the interactivity isn’t really the focus. The book is actually a guide on how to do exploratory data analysis. The first half of the book works like an advanced chart chooser, explaining which plots are useful for which types of data, and what types of interactivity they can benefit from. For me, it was worth it for the many rare plots, like spineplots and interaction plots and mosaic plots and fluctuation diagrams. If you’re bored of barcharts, this is a great way to expand your graphical vocabulary. The second half of the book consists entirely of case studies, where you can practice a workflow for exploring data, which is something that’s always worthwhile doing.

The really big takeaway that I got is that exploratory graphics have different priorities to publication graphics. When you are in the courting stage with a dataset, just getting to know each other, you don’t really care so much about whether the greek letters in your axis label are formatted correctly or whether the shade of pink in your dots is quite right. All you really need is to be able to generate lots and lots of plots quickly, and to be able to see the relationships between them.

It is this last point that the authors claim interactivity is most useful for. Perhaps the canonical example of this is clicking a bar on a histogram or barchart, and having corresponding points on a scatterplot highlighted. To demonstrate this, here’s an example using Simon’s Acinonyx package (shortly to be renamed ix for “iplots Extreme”). Acinonyx isn’t yet available on CRAN, see its home page
for installation details.

library(Acinonyx)        
library(MASS)
data(Cars93)
interactive_scatter <- with(Cars93, iplot(Horsepower, MPG.city))  
interactive_histo <- with(Cars93, ihist(EngineSize))

Click a bar in the histogram and the the corresponding points in the scatterplot are highlighted. Likewise, drag to select points in the scatterplot and fractions of the histgram are highlighted.

The equivalent static version would be to use trellising and draw each possible graph combination. Splitting a scatterplot into different groups depending upon bars of a histogram works something like this:

library(ggplot2)
Cars93$EngineSizeGroup <- cut(Cars93$EngineSize, 11)
(static_trellis_scatter <- ggplot(Cars93, aes(Horsepower, MPG.city)) +
  geom_point() +
  facet_wrap(~ EngineSizeGroup)
)

(We don’t actually need to bother with the histograms, since they are a little boring.) The reverse operation – going from a selected region of scatterplot to a higlighted region of bar chart is also possible, but trickier. In this case, we do need both graphs.

Cars93 <- within(Cars93, 
{
  selected <- ifelse(
    Horsepower < 200 & MPG.city > 20 & MPG.city < 30, 
    "selected", 
    "unselected"
  )
})
(static_scatter_with_highlight <-
  ggplot(Cars93, aes(Horsepower, MPG.city, colour = selected)) +
  geom_point()
)
(static_histo_with_highlight <- 
  ggplot(Cars93, aes(EngineSizeGroup, fill = selected)) +
  geom_histogram() + 
  opts(axis.text.x = theme_text(angle = 30, hjust = 1, vjust = 1))
)

My conclusion from reading the book, and from my initial experimentation with Acinonyx is that anything you can do interactively is also possible by drawing many static graphs, but the interaction can let you see things quicker.

Anonymising data

23rd August, 2011 7 comments

There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up:

Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are complaining about the ridiculous cost of train tickets these days. At this, one of the statisticians pipes up “it’s actually quite reasonable if use our method – we’ve just got one ticket between the three of us”.

The epidemiologists are amazed. “But how do you get away with that?”, they cried in unison.

“Watch and learn” replied a statistician.

A few minutes later, the inspector’s voice was heard down the carriage. At that, the statisticians bundled themselves into the toilet. The inspector knocked on the door. “Tickets please”, she said, and the statisticians passed their single ticket under the door. The inspector stamped it and returned it, and the statisticians made it to the conference.

On the way back, the statisticians again met the epidemiologists. This time, the epidemiologists proudly displayed their single ticket. “Aha”, said a statistician. “This time we have no tickets.” Again the epidemiologists were amazed, but they had little time to ponder it because the inspector was coming down the carriage. The epidemiologists dashed off into the toilet, and soon enough there was a knock on the door. “Tickets please”, they heard, and passed their ticket under the door. The statisticians took the ticket and went off to their own toilet!

The moral of the story being “never use a statistical technique that you don’t understand”.

All this preamble goes by way of saying: data anonymisation isn’t something that I know a great deal about, but I had some ideas and wanted to get feedback from you.

Any personal data of any importance needs to respect the privacy of the people it represents. Data containing financial or medical details in particular should not be exposed for public consumption (at least if you want people to continue providing you with their data). Anonymising data is an important concept in achieving this privacy.

While this is something you need to think about through the whole data lifecycle (from creating it, to storing it – probably in a database – through analysing it, and possibly publishing it) this post focuses on the analysis phase. At this stage, you data is probably in a data frame form, with some identifying columns that need to be anonymised, and some useful values that need to be preserved. Here’s some made-up data, in this case pacman scores of the Avengers.

pacman <- data.frame(
  id                = LETTERS[c(1, 2, 2, 2, 3, 4, 5, 6)],
  first_name        = c("Steve", rep.int("Tony", 3), "Natasha", "Clint", "Bruce", "Thor"),
  last_name         = c("Rogers", rep.int("Stark", 3), "Romanoff", "Barton", "Banner", NA),
  alias             = c("Captain America", rep.int("Iron Man", 3), "Black Widow", 
                        "Hawkeye", "The Hulk", "Thor"),
  gender            = rep(c("Male", "Female", "Male"), times = c(4, 1, 3)),
  pacman_score      = c(round(rlnorm(7, 9, 3), -1), 3333360),
  stringsAsFactors  = FALSE
)
cols_to_anon <- c("first_name", "last_name", "alias") 

(Naturally, Thor has godlike pacman abilities and achieves a perfect score.) There are two main ways of making data anonymous: removing or obfuscating the personal information, or aggregating it so you only provide summary data.

R has endless ways of aggregating data, tapply and the plyr package should be enough to get you started. This aggregation should be done as late in the day as possible, since summary data is in general less useful than raw data. The rest of the post focuses on removing or obfuscated personal info.

Method 1: Strip personal info columns

If you have an ID column, then the first obvious solution is it simply strip out the columns that reveal identifying information.

within(pacman, 
{
  first_name <- NULL
  last_name <- NULL
  alias <- NULL
})

Method 2: Create an ID column

If there is no ID column, or you don’t want to reveal it (since it gives information about your database, you need an alternative. You can create such an ID column by combining the identifying data into a single factor, then using the underlying integer code as an ID.

simple_id <- function(data, cols_to_anon)
{
  to_anon <- subset(data, select = cols_to_anon)
  ids <- unname(apply(to_anon, 1, paste, collapse = ""))
  as.integer(factor(ids))
}
pacman$method2_id <- simple_id(pacman, cols_to_anon)  

This is easy, but has the disadvantage that when your dataset is inevitably updated (by adding or removing rows), regenerating the ids will assign different numbers to your rows. It would be useful if you got the same answer for a row regardless of the state of the rest of your dataset.

Method 3: Use digest package to create the ids

The digest package creates hashes of values, which does exactly this.

anonymise <- function(data, cols_to_anon, algo = "sha256")
{
  if(!require(digest)) stop("digest package is required") 
  to_anon <- subset(data, select = cols_to_anon)
  unname(apply(to_anon, 1, digest, algo = algo))
}

pacman$method3_id <- anonymise(pacman, cols_to_anon)

(Try adding, deleting or reordering rows to check that you get the same IDs.) This is good enough for most purposes, but for high security cases it’s important to note two caveats. The description of the digest package notes that

this package is not meant to be deployed for cryptographic purposes for which more comprehensive (and widely tested) libraries such as OpenSSL should be used.

Secondly, applying a cryptocraphic hash to the actual values leaves them vulnerable to a rainbow table attack. A rainbow table is a table of all possible strings and their hashes. The attack means that (as long as the string is in the table) breaking the encryption just means looking up the hash in a table. The defense against this is to add some random junk, called “salt”, to the strings that you are encrypting. If you add enough junk, it will be longer than the values in the rainbow table, so you’ve escaped.

generate_salt <- function(data, cols_to_anon, n_chars = 20)
{                                                                
  index <- simple_id(data, cols_to_anon)
  n_indicies <- length(unique(index))   
  chars <- rawToChar(as.raw(32:126), multiple = TRUE)
  x <- replicate(n_indicies, paste(sample(chars, n_chars, replace = TRUE), collapse = ""))
  x[index]
}

pacman$salt <- generate_salt(pacman, cols_to_anon)
pacman$method4_id <- anonymise(pacman, c(cols_to_anon, "salt")) 

Of course, there’s a problem with this that you may have spotted. Salt is randomly generated, so if you update your dataset, as we discussed above, then you’ll get different salt. (Setting the random seed doesn’t help if you are generating different amounts of salt.) At this point, you might as well just use method 1 or 2, since they are easier.

So the problem of how to create truly secure anonymous data in R isn’t completely solved, for me at least. Let me know in the comments if you have any better ideas.

More useless statistics

22nd August, 2011 7 comments

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you will see this example soon) is the Anscombe quartet.

The data set is available in R as anscombe, but it requires a little reshaping to be useful.

anscombe2 <- with(anscombe, data.frame(
  x     = c(x1, x2, x3, x4),
  y     = c(y1, y2, y3, y4),
  group = gl(4, nrow(anscombe))
))

Note the use of gl to autogenerate factor levels.

So we have four sets of x-y data, which we can easily calculate summary statistics from using ddply from the plyr package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

library(plyr)
(stats <- ddply(anscombe2, .(group), summarize, 
  mean = mean(y), 
  std_dev = sd(y), 
  correlation = cor(x, y), 
  lm_intercept = lm(y ~ x)$coefficients[1], 
  lm_x_effect = lm(y ~ x)$coefficients[2]
))

  group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091

Each of the statistics is almost identical between the groups, so the data must be almost identical in each case, right? Wrong. Take a look at the visualisation. (I won’t reproduce the plot here and spoil the surprise; but please run the code yourself.)

library(ggplot2)
(p <- ggplot(anscombe2, aes(x, y)) +
  geom_point() +
  facet_wrap(~ group)
)

Each dataset is really different – the statistics we routinely calculate don’t fully describe the data. Which brings me to the second statistics joke.

A physicist, an engineer and a statistician go hunting. 50m away from them they spot a deer. The physicist calculates the trajectory of the bullet in a vacuum, raises his rifle and shoots. The bullet lands 5m short. The engineer adds a term to account for air resistance, lifts his rifle a little higher and shoots. The bullet lands 5m long. The statistician yells “we got him!”.

useR2011 highlights

18th August, 2011 11 comments

useR has been exhilarating and exhausting. Now it’s finished, I wanted to share my highlights.

10. My inner twelve year old schoolgirl swooning and fainting with excitement every time I chatted with a member of R-core.

9. Patrick Burns declaring that his company consists of himself and his two cats. And that one of the cats keeps changing the settings on his mail reader to spite him.

8. Søren Højsgaard and Robert Goudie both patiently answering my bazillion stupid questions on Bayesian networks.

7. Audience members high-fiving each other during my talk.

6. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow, so I can spend more time doing analysis that matters.

5. Peter Baker coming up with ideas for a don’t-repeat-yourself workflow. Oh, wait.

4. Jason and Tobias from OpenAnalytics talking about lab automation, so I can get those darned chemists off my back. (Just kidding, chemists; I love your data.)

3. Blogist Tal Galili volunteering as my speaking coach. (The big secret: talking really slowly to yourself before a presentation stops you overloading on adrenalin before you get up.)

2. Jonathan Rougier getting the audience to go “ah” every time he mentioned donkeys. And getting me to understand what the point of nomograms are.

1. Ben French teaching me that there is a third joke about stats. (I’ll tell you the other two another time.) It goes like this:

An engineer, a chemist and a statistician are staying in a hotel. (It’s a triple room, very cosy.) The first night they are there, a fire breaks out and wakes them up. The engineer gets up, grabs the fire extinguisher and puts out the fire. Later, they’re all woken by another fire (very dodgy health and safety). The chemist thinks “the fire reaction requires oxygen”, grabs a fire blanket and smothers the flames until they go out. Even later, the engineer and the chemist are woken by the statistician lighting a series of fires in the corner. “What are you doing?”, they cry in unison. “Increasing the sample size”, replies the statistician.

useR2011 Easy interactive ggplots talk

17th August, 2011 10 comments

I’m talking tomorrow at useR! on making ggplots interactive with the gWidgets GUI framework. For those of you at useR, here is the code and data, so you can play along on your laptops. For everyone else, I’ll make the slides available in the next few days so you can see what you missed. Note that for confidentiality reasons, I’ve added some random junk to the sample values, so please don’t use this data for actual research. (If you are interested in chemical exposure data, contact the lab.)

Chromium data. Nickel data. Code for examples 1 to 4. Once we introduce example 5, the rest of the code needs a little reworking.

Update: The permissions on the code files were blocking downloads from people who aren’t me. My bad. It should be fixed now.

Tags: , ,
Follow

Get every new post delivered to your Inbox.

Join 41 other followers