Archive

Archive for July, 2013

The Secrets of Inverse Brogramming, reprise

27th July, 2013 Leave a comment

Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.

At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.

With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.

The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.

My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.

The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.

The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.

Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.

There’s a style guide over the page, here.

Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)

The sig package helps you determine whether your code lives up to this ideal.

The sig function prints a function signature.

library(sig)
sig(read.csv)
## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec
##         = ".", fill = TRUE, comment.char = "", ...)

So far, so unexciting. The args and formals functions do much the same thing. (In fact the sig function is just a wrapper to formals with a pretty print method.)

It gets more interesting, when you look at the signatures of lots of functions together. list_sigs prints all the sigs from a file or environment.

list_sigs(pkg2env(tools))
## add_datalist <- function(pkgpath, force = FALSE)
## 
## bibstyle <- function(style, envir, ..., .init = FALSE, .default =
##         TRUE)
## 
## buildVignettes <- function(package, dir, lib.loc = NULL, quiet =
##               TRUE, clean = TRUE, tangle = FALSE)
## 
## check_packages_in_dir <- function(dir, check_args = character(),
##                      check_args_db = list(), reverse = NULL,
##                      check_env = character(), xvfb = FALSE, Ncpus =
##                      getOption("Ncpus", 1), clean = TRUE, ...)

Even from just the first four signatures, you can see that the tools package has a style problem. add_datalist and check_packages_in_dir are lower_under_case, buildVignettes is lowerCamelCase, and bibstyle is plain lowercase.

I don’t want to pick on the tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.

write_sigs is a variant of list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.

There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)

The sig package also contains a sig_report function that identifies problem functions. This example uses the Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.

sig_report(
  pkg2env(Hmisc),
  too_many_args = 25,
  too_many_lines = 200
)
## The environment contains 509 variables of which 504 are functions. 
## Distribution of the number of input arguments to the functions:
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
##   4  62 117  90  41  41  22  17  14  17  15  10   8   4   6   3   1 
##  17  18  19  20  21  22  23  24  27  28  30  33  35  48  66 
##   4   5   3   2   4   2   2   2   2   1   1   1   1   1   1 
## These functions have more than 25 input args:
## [1] dotchart2                     event.chart                  
## [3] labcurve                      latex.default                
## [5] latex.summary.formula.reverse panel.xYplot                 
## [7] rlegend                       transcan                     
## Distribution of the number of lines of the functions:
##          1          2      [3,4]      [5,8]     [9,16]    [17,32] 
##          1         47         15         57         98        108 
##    [33,64]   [65,128]  [129,256]  [257,512] [513,1024] 
##         81         58         30          8          1 
## These functions have more than 200 lines:
##  [1] areg                         aregImpute                  
##  [3] event.chart                  event.history               
##  [5] format.df                    labcurve                    
##  [7] latex.default                panel.xYplot                
##  [9] plot.curveRep                plot.summary.formula.reverse
## [11] print.char.list              rcspline.plot               
## [13] redun                        rlegend                     
## [15] rm.boot                      sas.get                     
## [17] summary.formula              transcan

How R will turn into SQL

16th July, 2013 14 comments

Up until very recently the only way of running R code was through the standard R distribution. Of course you could use another IDE, but somewhere underneath it all you would be running the same, standard R engine from the R-core team.

This is no longer your only option. A couple of weeks ago Radford Neal released pqR, a research project that reimplements pieces of the R engine to make it “pretty quick”. Much of pqR is likely to be folded back into the main R engine, but that isn’t the only new player in town.

A team at Tibco, including software architect Michael Sannella, have rewritten the R engine from the ground up for TERR, a commercial R distribution. This leads to some very interesting possibilities. Maybe in the few years we could see many more R engines appearing. Oracle have recently invested heavily in R, and it’s certainly imaginable that they could create a high performance R that is tightly coupled to their database products. Google are also big R investors, and I could easily see them creating an R engine that has parallelism built in.

After that, perhaps even Microsoft might notice R and fix the fact that its integration with .NET is rubbish. IronR, anyone?

This situation isn’t as far fetched as you may think: there is a very obvious precedent in the data analysis world. There are dozens of different database vendors that use a common interface langauge: SQL. In the world of databases, the engine is separate from the programming language.

This is a good thing – multiple vendors provides competition, differentiation and innovation. You have a full spectrum of products from big corporate databases (Oracle again) down to miniature data stores like sqlite.

The fact that SQL has led the way means that some of the potential traps are visible before we embark down this road. My biggest bugbear with SQL is that it isn’t quite universal: different vendors have their own non-standard extensions, which means that not all SQL code is portable. A situation like HTML or C++ or Fortran, where a standards committee defines an official version of the language would be preferable. Whether R-core or another body would set this standard is a matter to be decided. (I suspect that R-core would not welcome the additional administrative burden, and commercial vendors may want more control in defining the spec, so a separate R-standards group is more likely.)

These are interesting times for R, and I look forward to seeing how the separation of language and engine progresses.

Bootnotes:

  1. A sloppy sentence in a previous version of this post made it sound like Michael Sannella ran Tibco. He’s actually a software architect.
  2. The future is happening faster than we think. Renjin and Riposte are two other R engines.
Tags: , , , , ,

user2013: The caret tutorial

9th July, 2013 Leave a comment

This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.

The basic usage of caret is to split your data into training and test sets.

my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p
names(my_data) <- c("training", "testing")

Then call train on your training set.

training_model <- train(
  response ~ ., 
  data   = my_data$training,
  method = "a type of model!")

Then predict it using predict.

predictions <- predict(training_model, my_data$testing)

So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.

Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.

I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.

user2013: The Rcpp tutorial

9th July, 2013 Leave a comment

I’m at user 2013, and this morning I attended Hadley Wickham and Romain Francois’s tutorial on the Rcpp package for calling C++ code from R. I’ve spent the last eight years avoiding C++ afer having nightmares about obscure pointer bugs, so I went into the room slightly skeptical about this package.

I think the most important takeaway from the tutorial was a clear sense of when and why you might want to use C++.

The main selling point for using C++ with R is, in Hadley’s words, that R is optimised for making programmers efficient whereas C++ is made for making machine efficient, so the langauages are complimentary. That is, most of the time the slow part of doing statistics is you. Occasionally however, the slow part will be running your code, and in those instances C++ is better than R.

In order to write fast R code, it needs to be vectorised, and that often means using different functions to a scalar version. A classic example is using ifelse instead of separate if and else blocks, or using pmax instead of max.

Knowing how to vectorise R code thus requires quite a large vocabulary of functions. In C++ there is no vectorisation – you just write a for loop.

There are three things in particular that C++ does much faster than R: the above mentioned looping, resizing vectors and calling functions. (For the last point, Hadley quoted an overhead of 2ns to call a function in C++ versus 200ns in R.)

This means that C++ is useful for the following restricted use cases:

  1. When vectorisation is difficult or impossible. This is common when one element of a vector depends upon previous elements. MCMC is a classic example.
  2. When you are changing the size of a vector in a loop. Run length encoding was the example given.
  3. When you need to make millions of function calls. Recursive functions and some optimisation and simulation problems fit this category.

Typically C++ can give you an order of magnitude or two speed up over an R equivalent, but this is wildly problem-dependent and many of the built-in functions call C code which will run at the same speed (more or less) as a C++ version. It’s also important to consider how often the code will be run. Even if you have a thousand-fold speedup, if the running time of the R function is 0.01s, then you need to run it 60000 times just to get back the 10 minutes it took you to rewrite it in C++.

Anyway, using Rcpp makes it surprisingly simple to call C++ code. You need to install Rtools under windows, and of course the Rcpp package.

install.packages(c("installr", "Rcpp"))
library(installr)
install.Rtools()

Check that Rcpp is working by seeing if the following expression returns 2.

library(Rcpp)
evalCpp("1 + 1")

Then you can create a C++ function using the cppFunction function. Here’s a reimplementation of the any function. (Although it doesn’t deal with missing values.)

cppFunction('
  bool Any(LogicalVector x)
  {
    for(int i = 0; i < x.size(); ++i)
    {
      if(x[i])
      {
        return true;
      }
    }
    return false;
  }
')

Notice that in C++ you must be explicit about the types of variable that are passed into and returned from a function. How to wrie C++ is beyond the scope of the post, so I’ll say no more.

You can now call the Any function like this.

Any(runif(10) > 0.5) #returns TRUE
Any(runif(10) > 1.5) #returns FALSE
Tags: , , ,