### Archive

Archive for January, 2011

The cat function bugs me a little. There are two quirks in particular that I find irritating on occasions that I use it.

Firstly, almost everything that I want displayed onscreen, I want on its own line.

> cat("cat messes up my command prompt position")
cat messes up my command prompt position>


So it would be really nice if the function appended a newline to the things I input. Sure, I can manually add the newline, but it looks ugly, and I shouldn’t have to mix formatting with content. Fortunately, the fix is simple.

catn <- function(...) cat(..., "\n")
catn("Yes, I would like this content on its own line.")


Feel free to name the improved cat something more exciting, like tiger.

My second bugbear is the inability for cat to take sprintf style arguments. Again, this is easily fixed.

cats <- function(..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE)
{
cat(sprintf(...), file = file, sep = sep, fill = fill, labels = labels, append = append)
}
#Or, combining the two ideas
catsn <- function(...) catn(cats(...))
catsn("The temperature is %g Celcius in %s", -4, "Buxton")


Another tiny problem solved! Before you go using these lovely cat variations, it’s important to discuss when they should be used. Some functions use cat to display progress messages or similar information to the user. This is bad form: if you want to send the user a message then use the message function instead. (This gives you a newline automatically.) The important thing about the message function is that it can be turned off. Compare, e.g.,

count <- function(n) for(i in seq_len(n)) message(i)
count(4)
suppressMessages(count(4))


cat on the other hand should only be used inside print functions (where the user has explicitly requested content onscreen) or in cases where you need to tell the user something.

Tags: ,

## When 1 * x != x

Trying to dimly recall things from my maths degree, it seems that in most contexts the whole point of the number one is that it is a multiplicative identity. That is, for any x in your set, 1 * x is equal to x. It turns out that when you move to floating point numbers, in some programming lanugages, this isn’t always true.

In R, try the following

x <- Inf + 0i
1 * x
[1] Inf + NaNi


Something odd is happening, and my first reaction was that this is a bug. In fact, I reported it as such (in a slightly different form) but the terrifyingly wise Brian Ripley came up with the cause.

When you multiply two complex numbers, this is what happens

$(a + bi)(c + di) = (ac - bd) + (ad + bc)i$

In this instance

$(Inf + 0i)(1 + 0i) = (Inf * 1 - 0 * 0) + (Inf * 0 + 0 * 1)i = Inf + NaNi$

So, there is a reasonable argument that R is doing the right thing.

MATLAB works a little differently since all it’s numbers are implicitly complex. The inner workings of MATLAB are somewhat opaque for commercial reasons, but in order for compiled MATLAB code to work, I think it’s reasonable to assume that MATLAB stores its variables in a class that is similar to the MWArray class used by compiled code. As far as I can tell, each double matrix contains two matrices for storing real and imaginary components, and has an IsComplex property that is true whenever the complex part has any nonzero values. If IsComplex is false, only the real matrix is used for arithmetic.

Thus in MATLAB, when you define x = Inf + 0i it immediately resolves itself to simply Inf, and we don’t get the same weirdness.

x = Inf + 0i
1 * x
Inf


Other languages are divided on the subject: python and ruby agree with R but Mathematica (or at least Wolfram Alpha) agrees with MATLAB.

$\lim_{n \to \infty} (n + 0i)(1 + 0i) = \lim_{n \to \infty} (n * 1 - 0 * 0) + (n * 0 + 0 * 1)i = \lim_{n \to \infty} n = Inf$

This concurs with the MATLAB answer and I’m inclined to agree that it makes more sense, but the issue isn’t entirely clear cut. Changing the way complex arithmetic works for a single edge case is likely to introduce more confusion than clarity. It is perhaps sufficient for you to remember that complex infinity is a bit pathological with arithemtic in some languages, and add checks in your code if you think that that will cause a problem.

## Introducing the Lowry Plot

Here at the Health and Safety Laboratory* we’re big fans of physiologically-based pharmacokinetic (PBPK) models (say that 10 times fast) for predicting concentrations of chemicals around your body based upon an exposure. These models take the form of a big system of ODEs.

Because they contain many equations and consequently many parameters (masses of organs and blood flow rates and partition coefficients), it is important to do a sensitivity analysis on the model to determine which parameters are important and which are not. After all, it is tricky to accurately measure many of these parameters, so you need to know whether or not an error in a parameter will throw off the whole model.

How to do sensitivity analysis of an ODE-based model is a worthy topic of discussion, but not one I want to cover today. For now, it is sufficient to know the basic idea: you input different parameter combinations and see which parameters cause the model output to change the most.

For each parameter, the sensitivity analysis returns a main effect, which is the fraction of variance in model output accounted for by that parameter. Each parameter also has an interaction effect, which is the fraction of variance accounted for by that parameter’s interaction with all the other parameters. The main effect + interactions give the total effect for that parameter. Got it?

When I was working on this problem recently, I searched the literature for ways to display the results of the sensitivity analysis. “Draw a pie chart” it said. Not with less than 4 dimensions, thank you! “How about a bar chart?”, the literature suggested. How boring. It was clear that something better needed to be done.

Note that the plotting technique I’m about to describe is suitable for any sensitivity analysis, not just those of PBPK models.

The first step from the basic bar chart was to switch to a series of 3 Pareto plots for main effects, interactions and total effects. This was a big improvement, but having three graphs to look at was a pain.

I also tried a stacked bar chart for the total effects, but this still didn’t answer the most important question:

How many parameters should I include in my model?

This is a rather fuzzy question, but it is possible to restate it in a way that becomes solvable

What is the least number of parameters that I need to include to account for a given percentage of variance in my model?

It turns out that you don’t know an exact value for the amount of variance accounted for – you just know a range. The lower bound is the sum of the main effects of the parameters you include in the model. The upper bound is the sum of that total effects that you included. Actually we can do better than that. The effect of interactions are double counted (since each interaction is the interaction with all other parameters). So our second upper bound is that the included vairance can never exceed 100% minus the sum of all the main effects that we didn’t include.

I realise that there have been far too many long sentences in this post already, so I’ll skip to my solution. After some pondering, I realised that I could combine the Pareto plot with the stacked bar chart, which I call the Lowry Plot, named for the Lancashire artist – the resultant graph looks a lot like a smoke stack, of which he painted many.

If you include the first 4 parameters then, reading up from the fourth bar, you account for between 85% and 90% of the variance. Conversely, to include at least 90% of the variance, you find the bar where the whole ribbon is above 90%. In this case you need the first six parameters.

To recreate this, here’s some sample data taken from a model of m-xylene. The important thing is that you need 3 columns: parameter name, main effect and interaction.

library(ggplot2)

m_xylene_data <- data.frame(
Parameter = c(
"BW", "CRE", "DS", "KM", "MPY", "Pba", "Pfaa",
"Plia", "Prpda", "Pspda", "QCC", "QfaC", "QliC",
"QPC", "QspdC", "Rurine", "Vfac", "VliC", "Vmax"),
"Main Effect" = c(
1.03E-01, 9.91E-02, 9.18E-07, 3.42E-02, 9.27E-3, 2.82E-2, 2.58E-05,
1.37E-05, 5.73E-4, 2.76E-3, 6.77E-3, 8.67E-05, 1.30E-02,
1.19E-01, 4.75E-04, 5.25E-01, 2.07E-04, 1.73E-03, 1.08E-03),
Interaction = c(
1.49E-02, 1.43E-02, 1.25E-04, 6.84E-03, 3.25E-03, 7.67E-03, 8.34E-05,
1.17E-04, 2.04E-04, 7.64E-04, 2.84E-03, 8.72E-05, 2.37E-03,
2.61E-02, 6.68E-04, 4.57E-02, 1.32E-04, 6.96E-04, 6.55E-04
)
)


Sometimes it is easier to use the data in wide format, other times in long format. This data fortification process returns both, plus some extra columns.

fortify_lowry_data <- function(data,
param_var = "Parameter",
main_var = "Main.Effect",
inter_var = "Interaction")
{
#Convert wide to long format
mdata <- melt(data, id.vars = param_var)

#Order columns by main effect and reorder parameter levels
o <- order(data[, main_var], decreasing = TRUE)
data <- data[o, ]
data[, param_var] <- factor(
data[, param_var], levels = data[, param_var]
)

#Force main effect, interaction to be numeric
data[, main_var] <- as.numeric(data[, main_var])
data[, inter_var] <- as.numeric(data[, inter_var])

#total effect is main effect + interaction
data$.total.effect <- rowSums(data[, c(main_var, inter_var)]) #Get cumulative totals for the ribbon data$.cumulative.main.effect <- cumsum(data[, main_var])
data$.cumulative.total.effect <- cumsum(data$.total.effect)

#A quirk of ggplot2 means we need x coords of bars
data$.numeric.param <- as.numeric(data[, param_var]) #The other upper bound #.maximum = 1 - main effects not included data$.maximum <- c(1 - rev(cumsum(rev(data[, main_var])))[-1], 1)

data\$.valid.ymax <- with(data,
pmin(.maximum, .cumulative.total.effect)
)

mdata[, param_var] <- factor(
mdata[, param_var], levels = data[, param_var]
)
list(data = data, mdata = mdata)
}


The plot is essentially a bar chart plus a cumulative frequency ribbon, with some tweaks to make it prettier.

lowry_plot <- function(data,
param_var = "Parameter",
main_var = "Main.Effect",
inter_var = "Interaction",
x_lab = "Parameters",
y_lab = "Total Effects (= Main Effects + Interactions)",
ribbon_alpha = 0.5,
x_text_angle = 25)
{
#Fortify data and dump contents into plot function environment
data_list <- fortify_lowry_data(data, param_var, main_var, inter_var)
list2env(data_list, envir = sys.frame(sys.nframe()))

p <- ggplot(data) +
geom_bar(aes_string(x = param_var, y = "value", fill = "variable"),
data = mdata) +
geom_ribbon(
aes(x = .numeric.param, ymin = .cumulative.main.effect, ymax = .valid.ymax),
data = data,
alpha = ribbon_alpha) +
xlab(x_lab) +
ylab(y_lab) +
scale_y_continuous(formatter = "percent") +
opts(axis.text.x = theme_text(angle = x_text_angle, hjust = 1)) +
scale_fill_grey(end = 0.5) +
opts(legend.position = "top",
legend.title = theme_blank(),
legend.direction = "horizontal"
)
p
}

m_xylene_lowry <- lowry_plot(m_xylene_data)
m_xylene_lowry


EDIT: I’ve tweaked to code to make it a littler simpler; the resampling is no longer needed.

*Dear legal folk, this is a personal blog and I’m not officially speaking on behalf of my employer. That said, a substantial chunk of the people who care about these things love PBPK.

Tags: ,

## Really useful bits of code that are missing from R

There are some pieces of code that are so simple and obvious that they really ought to be included in base R somewhere.

Geometric mean and standard deviation – a staple for anyone who deals with lognormally distributed data.

geomean <- function(x, na.rm = FALSE, trim = 0, ...)
{
exp(mean(log(x, ...), na.rm = na.rm, trim = trim, ...))
}

geosd <- function(x, na.rm = FALSE, ...)
{
exp(sd(log(x, ...), na.rm = na.rm, ...))
}


A drop option for nlevels. Sure your factor has 99 levels, but how many of them actually crop up in your dataset?

nlevels <- function(x, drop = FALSE) base::nlevels(x[, drop = drop])


A way of converting factors to numbers that is quicker than as.numeric(as.character(my_factor)) and easier to remember than the method suggested in the FAQ on R.

factor2numeric <- function(f)
{
if(!is.factor(f)) stop("the input must be a factor")
as.numeric(levels(f))[as.integer(f)]
}


A “not in” operator. Not many people know the precedence rules well enough to know that !x %in% y means !(x %in% y) rather than (!x) %in% y, but x %!in% y should be clear to all.

"%!in%" <- function(x, y) !(x %in% y)


EDIT:
Thanks for all your suggestions. I had another idea while drifting off to sleep last night. The error message thrown by stopifnot is a little clunky, which means that I end up with lots of instances of if(!some_condition) stop("A nicer error message"). The factor2numeric function above is typical. If stopifnot allowed for custom error messages, I’d be much more inclined to use it.

stopifnot <- function (..., errmsg = NULL)
{
n <- length(ll <- list(...))
if (n == 0L)
return(invisible())
mc <- match.call()
for (i in 1L:n) if (!(is.logical(r <- ll[[i]]) && !any(is.na(r)) &&
all(r))) {
ch <- deparse(mc[[i + 1]], width.cutoff = 60L)
if (length(ch) > 1L)
ch <- paste(ch[1L], "....")
if(is.null(errmsg)) errmsg <- paste(ch, " is not ", if (length(r) > 1L)
"all ", "TRUE", sep = "")
stop(errmsg, call. = FALSE)
}
invisible()
}


ANOTHER EDIT:
Once you start thinking about this, it’s really easy to keep coming up with ideas. Checking to see if an object is scalar is easy – it just has to have length 1.

is.scalar <- function(x) length(x) == 1

Tags:

## My New Year’s Resolution: Be lazier

I wrote this on New Year’s Eve but given the contents it seemed more appropriate to post a few days late. In many walks of life, laziness is a terrible vice but for programmers and statisticians it can be an unsung virtue. You don’t believe me? Then read on to hear my ideas for virtuous laziness.

Idea 1: Write less code
Writing less code means less code to maintain which means even less work – a virtuous circle of laziness. Of course, you can’t just stop doing your job, but you can use existing packages instead of reinventing wheels and you can write code that is reusable (write functions instead of scripts and packages instead of loose functions).

Idea 2: … but code instead of clicking
There are loads of little tasks that computers (and other machines) do better than humans. You just need to tell them to do it! If you find yourself typing the same thing over and over again, write a function, script or macro so you don’t need to bother next time. Jobs that complete themsleves automatically are the best kinds of jobs!

Idea 3: If your can’t automate, then simplify
Of course, many tasks are tricky to automate. In that case, can you simplify your problem? If you’re doing bleeding edge research, can you make your task straightforward enough that other researchers can do it? If you’re doing something more routine, can you simplify it enough that the intern could take over? If you’re the intern, can you simplify your job so that a kid could do it? Now we’re nearly at the point of automation.

Idea 4: Give the gift of laziness
I know that a lot of you readers are coders and will probably have heard something like this idea before. Out there in the wider world, I don’t think the concept of automating tasks is as common. This doesn’t always mean programming things either. Over the coming days, teach your grandma about keyboard shortcuts or take the time to introduce a less technical colleague to the idea of network shortcuts or the button that minimizes all your windows. (It’s amazing how few people seem to know about that!)