Home > R > Stop! (In the name of a sensible interface)

Stop! (In the name of a sensible interface)

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times if, ifelse or switch is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

get_all_fns <- function(pattern = ".+")
{
  fn_names <- apropos(pattern)
  fns <- lapply(fn_names, get)
  names(fns) <- fn_names
  Filter(is.function, fns)
}
fns <- get_all_fns()

As before, we use deparse to turn the function’s body into an array of strings to examine. This time, we are looking for calls to if, ifelse or switch.

   
get_complexity <- function(fn) 
{
  body_lines <- deparse(body(fn))  
  flow <- c("if", "ifelse", "switch")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
complexity <- sapply(fns, get_complexity)

Let’s take a look at the distribution of this complexity measure.

     
library(ggplot2)
hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) + 
  geom_histogram(binwidth = 3) 
hist_complexity 

Histogram of complexity, by number of calls to if, ifelse or switch

Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

head(sort(complexity, decreasing=TRUE))
       library          arima    help.search       read.DIF         coplot [<-.data.frame 
            84             81             71             66             65             63

Hmm, it’s the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should “fail early and fail often”. That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don’t waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to stop and stopifnot.

   
get_n_exceptions <- function(fn) 
{
  body_lines <- deparse(body(fn))  
  flow <- c("stop", "stopifnot")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
n_exceptions <- sapply(fns, get_n_exceptions)

Once again we examine the distribution …

  
hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) + 
  geom_histogram(binwidth = 1) 
hist_exceptions 

Histogram of number of exceptions

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

head(sort(n_exceptions, decreasing=TRUE)) 
      read.DIF        library [<-.data.frame          arima         arima0        glm.fit 
            17             16             15             14             13             13           

The function with the most potential exceptions to throw is read.DIF. File handling is notoriously problematic, so that’s fair enough. Load the survival package for a better example. The Surv function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to stop to prevent disaster.

So when you are writing a function and you see the 20th call to stop, that’s a hint that you may need to stop (if you want a sensible interface).

About these ads
  1. No comments yet.
  1. 14th August, 2011 at 16:43 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 217 other followers

%d bloggers like this: