Archive

Archive for September, 2010

The truth is stranger than fiction

27th September, 2010 Leave a comment

It turns out that when you search Google for the phrase 4D pie chart, this blog is only the second link.  The number one spot is taken by a tutorial on how to create animated pie charts with Maxon Computer’s Cinema 4D package.  I can’t detect any hint of sarcasm in the instructions.  Hang your head in shame, Maxon Computer employee.

Tags: ,

Dotplot on xkcd

27th September, 2010 Leave a comment

There’s a nice example of a dotplot on xkcd today.  Hand drawn lines aside, it’s well drawn.  The only obvious thing wrong with it is the scale.  If you’re going to log-transform your data, then the axes labels should show the untransformed amounts.  So the numbers -17 to -5 along the top x-axis should really be e^-17 to e^-5, or even better, number that are based 10, because they are much easier to understand.  For example, compare:

“Stochastic is used e to the power 14 times as often as f*cking stochastic.”

“Stochastic is used 10 to the power 6 times as often as f*cking stochastic.”

Tags:

Visualising questionnaires

25th September, 2010 9 comments

Last week I was shown the results of a workplace happiness questionnaire.  There’s a cut down version of the dataset here, with numbers and wording changed to protect the not-so-innocent.

The plots were ripe for a makeover.  The ones I saw were 2nd hand photocopies, but I’ve tried to recreate their full glory as closely as possible.

A badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeoverA badly styled stacked barchart, ripe for a makeover To the creator’s credit, they have at least picked the correct plot-type: a stacked bar chart is infinitely preferable to a pie chart.  That said, there’s a lot of work to be done.  Most obviously, the pointless 3D effect needs removing, and the colour scheme is badly chosen.  Rainbow style colour schemes that change hues are best suited to unordered categorical variables.  If you have some sense of ordering to the variable then a sequential scale is more appropriate.  That means keeping the hue fixed and either scaling from light to dark, or from grey to a saturated colour.  In this case, we have ordering and also a midpoint – the “neutral” response.  That means that we should use a diverging scale (where saturation or brightness increases as you move farther from the mid-point).

More problematic than these style issues is that it isn’t easy to answer any useful question about the dataset.  To me, the obvious questions are

On balance, are people happy?

Which questions indicate the biggest problems?

Which sections indicate the biggest problems?

All these questions  require us to condense the seven points of data for each question down to a single score, so that we can order the questions from most negative to most positive.  The simplest, most obvious scoring system is a linear one.  We give -3 points for “strongly disagree”, -2 for “disagree”, through to +3 for “strongly agree”.  (In this case, all the questions are phrased so that agreeing is a good thing.  A well designed questionnaire should contain a balance of positively and negatively phrased questions to avoid yes ladder (link is NSFW) type effects.  If you have negatively phrased questions, you’ll need to reverse the scores.  Also notice that each question uses the same multiple choice scale.  If your questions have different numbers of responses, or more than one answer is allowed then it may be inappropriate to compare the questions.)

Since the scoring system is slightly arbitrary, it is best practise to check your results under a different scoring system.  Perhaps you think that the “strongly” responses should be more heavily weighted, in which case a quadratic scoring system would be appropriate.  (Replace the weights 1:3 with (1:3)^2/2.)  Assume the data is stored in the data frame dfr.

dfr$score.linear <- with(dfr,
   -3 * strongly.disagree - 2 * disagree - slightly.disagree +
   slightly.agree + 2 * agree + 3 * strongly.agree)
dfr$score.quad <- with(dfr,
   -4.5 * strongly.disagree - 2 * disagree - 0.5 * slightly.disagree +
   0.5 * slightly.agree + 2 * agree + 4.5 * strongly.agree)

For the rest of this post, I’ll just present results and code for the linear scoring system. Switch the word “linear” with “quad” to see the alternative results. To get an ordering from “worst” to “best”, we order by -score.linear.

dfr_linear <- within(dfr,
{
   question <- reorder(question, -score.linear)
   section <- reorder(section, -score.linear)
})

To make the data frame suitable for plotting with ggplot, we reshape it from wide to long format.

library(reshape)
w2l <- function(dfr) melt(dfr, measure.vars = colnames(dfr)[4:10])
mdfr_linear <- w2l(dfr_linear)

To answer the first question, we simply take a histogram of the score, and see if they are mostly above or below zero.

library(ggplot2)
hist_scores_linear <- ggplot(dfr, aes(score.linear)) + geom_histogram(binwidth = 10)

histogram of linear scores

Hmm, not good. Most of the questions had a negative score, implying that the workforce seems unhappy. Now we want to know why they are unhappy. Here are the cleaned up versions of those stacked bar charts again. As well as the style improvements mentioned above, we plot all the questions together, and in order of increasing score (so the problem questions are the first things you read).

bar_all_q_linear <- ggplot(mdfr_linear, aes(question, value, fill = variable)) +
   geom_bar(position = "stack") +
   coord_flip() +
   xlab("") +
   ylab("Number of responses") +
   scale_fill_brewer(type = "div")

bar chart of all questions, linear score

So deaf managers are the biggest issue.  Finally, it can be useful to know if which of the sections scored badly, to find more general problem areas. First we find the mean score by section.

mean_by_section <- with(dfr_linear, tapply(score.linear, section, mean))
dfr_mean_by_section <- data.frame(
   value = mean_by_section,
   section = names(mean_by_section)
)

Now we visualise these scores as a dotplot.

plot_by_section <- function(p)
{
   p + geom_point(colour = "grey20") +
      geom_point(aes(value, section),
      data = dfr_mean_by_section,
      xlab("Score") + ylab("")
}

pt_by_section_linear <- plot_by_section(
   ggplot(mdfr_linear, aes(score.linear, section))
)

dot plot of scores by section, linear score

Here you can see that communication the biggest problem area.

Tags: ,

Which functions in R base call internal code?

14th September, 2010 4 comments

In a recent question on Stack Overflow about speeding up group-by operations, Marek wondered which functions called .Internal code (and consequently were fast). I was surprised to see that there appears to be no built-in way to check whether or not this is the case (though is.primitive is available for primitive functions).

Writing such a function is quite straight forward. We simply examine the contents of the function body, and search for the string “.Internal” within it.

callsInternalCode <- function(fn)
{
+++if(!require(stringr)) stop("stringr package required")
+++any(str_detect(capture.output(body(fn)), "\\.Internal"))
}

You can retrieve all the functions in the base package with this one-liner taken from an example on the Filter help page.

funs <- Filter(is.function, sapply(ls(baseenv()), get, baseenv()))

Now getting the list of functions that call internal code is easy.

names(funs)[sapply(funs, callsInternalCode)]
[1] "abbreviate" "agrep"
[3] "all.names" "all.vars"
[5] "anyDuplicated.default" "aperm"
...

EDIT: Tweaked the regex (the ‘.’ should have been escaped.)

Tags: ,

Preventing argument use in R

12th September, 2010 2 comments

It sounds silly, but sometimes you don’t want to let people use some arguments of a function. The canonical example is write.csv. The function is effectively a wrapper to write.table, but using “,” as the separator and “.” as the decimal.

We could implement the function in a really simple way, as

simple.write.csv <- function(...) write.table(sep = ",", dec = ".", ...)

Under most circumstances, this implementation works as expected. The problem occurs when we pass the sep or dec arguments into the function; in this case we get a rather unhelpful error message.

Error in write.table(sep = ",", dec = ".", ...) :
+++formal argument "sep" matched by multiple actual arguments

To get round this, the real implementation of write.csv is a little more complex. The basic idea behind it is this:

“Take a look at the call to the function write.csv. Warn the user if any forbidden arguments were passed in, then replace them with specified values. Then pass these on to write.table.”

Let’s take a look at the code.


write.csv <- function (...)
{
+++Call <- match.call(expand.dots = TRUE)
+++for (argname in c("append", "col.names", "sep", "dec", "qmethod"))
++++++if (!is.null(Call[[argname]]))
+++++++++warning(gettextf("attempt to set '%s' ignored", argname), domain = NA)
+++rn <- eval.parent(Call$row.names)
+++Call$append <- NULL
+++Call$col.names <- if (is.logical(rn) && !rn) TRUE else NA
+++Call$sep <- ","
+++Call$dec <- "."
+++Call$qmethod <- "double"
+++Call[[1L]] <- as.name("write.table")
+++eval.parent(Call)
}

Understanding this requires some technical knowledge of the R language. When you type things at the R command prompt and hit Enter, two things happen. Those characters are turned into a call, and then they are evaluated to give you your answer. Effectively, what happens is

eval(quote(the stuff you typed))

You can examine the contents of a call by converting it to be a list, for example


> as.list(quote(2+3))[[1]]
`+`

[[2]]
[1] 2

[[3]]
[1] 3

The first element of the list gives the function being called, and the rest of the elements are the arguments to pass in to that function. Notice that even ‘+’ is a function (and so is ‘<-‘).

If you want to know more about this, then take a look at section 6.1 of the R Language definition. Expressions, which are, more or less, lists of calls are discussed in section 6.4.

match.call is like calling quote(the stuff you typed when you called this function)

Returning to the example, let’s have a sample call to write.csv.


dfr <- data.frame(x = 1:5, y = runif(5))
write.csv(dfr, file = "test.csv", sep = "!")

The crux of understanding write.csv is in the variable Call, assigned on the first line of the function.


> as.list(Call)[[1]]
write.csv

[[2]]
dfr

$file
[1] "test.csv"

$sep
[1] "!"

The rest of the function involves manipulating this object. The element sep is replaced with a comma (with a warning) and some additional arguments are set. The function that is called is then swapped for write.table, and finally this call is evaluated.

Deep understanding of this can get a bit mind-melting but if you want to use this principle in your own functions, you can more or less copy and paste from write.csv.

Tags: ,

Variability of biomarkers in volunteer studies

9th September, 2010 Leave a comment

Variability of biomarkers in volunteer studies: The biological component, of which I am co-author, is now published in Toxicology letters.  It’s a simple, straightforward look at calculating variability in half-lives of chemicals.

Tags: ,

Oh (de)bugger! Part II

4th September, 2010 Leave a comment

It’s Friday night, and I recently discovered that adding a dash of Cointreau somewhat enhances a gin and tonic.  Consequently, I’m drunk-blogging.  Will try not to make too many tpyos.

In the first part of this series, I discussed some standard debugging techniques, like the use of the browser function.  One of the limitations of browser is that it isn’t much help if the problem is somewhere higher up the call-stack.  That is, the case when you passed a dodgy argument to another function, which maybe passed it on to something else before your code fell over.

If you have an error, but you don’t know where it is, then a helpful thing to see is your call stack, and the variables that you’ve put in it.  The function I’m going to show you uses sys.calls to retrieve the call stack, and sys.frames to retrieve the environment for each function you called.  Next it calls ls.str in each of those environments to tell you what variables you’ve created.  Finally, there are some methods to print your output more clearly.  After all, the whole point of this function is to give you quick insight into what you’ve just done.

When I was trying to think of a suitable name for this function, I toyed with variations on the word “stack”.  Eventually, I figured that it should be named for the word most commonly used when things go wrong.  And so, I present to you, the bugger function.


bugger <- function(...)
{
# ... contains arguments passed to ls.str.  See that function for a description.
+++stack_length <- sys.nframe()
+++call_stack <- sys.calls()[-stack_length] # -stack_length is so we don't include this function
+++frames <- sys.frames()[-stack_length]
+++ls_stack <- lapply(frames, function(e) ls.str(envir = e, ...))
+++call_names <- make.names(lapply(call_stack,
++++++function(cl) as.character(cl)[1]), unique = TRUE)
+++names(call_stack) <- call_names
+++names(ls_stack) <- call_names
+++class(call_stack) <- "call_stack"
+++class(ls_stack) <- "ls_stack"
+++list(call_stack = call_stack, ls_stack = ls_stack)
}

The call to make.names is there to ensure that each items in call_stack and ls_stack have a unique name, and thus can be referenced more easily.   To finish, these two components get S3-style overloads of the print function.


print.call_stack <- function(x)
{
+++n <- length(x)
+++if(n < 1) cat("Call stack is empty\n")
+++calls <- sapply(x, function(x) as.character(as.expression(x)))
+++cat(paste(seq_len(n), calls, sep = ": "), sep = "\n")
+++invisible(x)
}


print.ls_stack <- function(x, ...)
{
+++n <- length(x)
+++if(n < 1) cat("Call stack is empty\n")
+++titles <- paste(seq_len(n), names(x), sep = ": ")
+++for(i in seq_len(n))
+++{
++++++cat(titles[i], "\n")
++++++noquote(print(x[[i]], ...))
++++++cat("\n")
+++}
+++invisible(x)
}

And finally, here’s an example to put the code to use


bar <- function(xx, ...)
{
+++bugger()
}


foo <- function(x, y, z)
{
+++another_variable <- 999
+++bar(x, y, letters[1:5])
}


foo(1:10, "monkey", list(runif(3), zzz = "zzz"))

This yields the result

$call_stack
1: foo(1:10, "monkey", list(runif(3), zzz = "zzz"))
2: bar(x, y, letters[1:5])


$ls_stack
1: foo
another_variable : num 999
x : int [1:10] 1 2 3 4 5 6 7 8 9 10
y : chr "monkey"
z : List of 2
+$ : num [1:3] 0.154 0.974 0.189
+$ zzz: chr "zzz"2: bar
xx : int [1:10] 1 2 3 4 5 6 7 8 9 10

One thing to note is that ls.str doesn’t pick up the ... arguments.  It is possible to retrieve the call to those dots from each environment, but it isn’t always possible to evaluate them.  Playing with R’s evaluation model in this way is deep magic, so accessing them involves excessive use of substitute and eval.  I’m going to postpone talking about them for an occasion that didn’t involve gin and cointreau.

Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 217 other followers