Webcast on Writing Great R Code

19th September, 2013 6 comments

While I’m promoting things, you might also want to know that I’m doing a webcast on how to write great R code next Wednesday. It’s at 6pm British Summer Time or 10am Pacific Daylight Time.

the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.

You’ll get to hear about:

  • Writing stylish code.
  • Finding bad functions with the sig package.
  • Writing robust code with the assertive package.
  • Testing your code with the testthat package.
  • Documenting your code with the roxygen2 package.

It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.

O’Reilly R ebooks half price – today only

19th September, 2013 2 comments

Including my hot-off-the-press Learning R. Buy two copies!

http://oreil.ly/1eRrhP1

Fearsome Engines, Part 1

7th September, 2013 16 comments

Back in June I discovered pqR, Radford Neal’s fork of R designed to improve performance. Then in July, I heard about Tibco’s TERR, a C++ rewrite of the R engine suitable for the enterprise. At this point it dawned on me that R might end up like SQL, with many different implementations of a common language suitable for different purposes.

As it turned out, the future is nearer than I thought. As well as pqR and TERR, there are four other projects: Renjin, a Java-based rewrite that makes it easy to integrate with Java software and has some performance benefits; fastR, another Java-based engine focused on performance; Riposte, a C++ rewrite that also focuses on performance; and CXXR, a set of C++ modifications to GNU R that focus on maintainability and extensibility.

I think that having a choice of R engine is a good thing. The development model of one implementation with a team acting as gatekeepers as to what goes into the codebase is, well, a bit communist. Having alternatives should introduce competition and (go capitalism!) bring improvements at a greater rate.

The fact that we now have a choice of seven different engines for running R code is amazing news, but it takes a traditional R problem to a new level. Rather than just worrying about which implementation of an algorithm to use in which package, you now have to worry about which version of R to use.

In order to try and figure out which projects might be suitable for which purposes, and what stage each one is at, I spoke to members of each project. In alphabetical order of project, they were:

Andrew Runnalls from the University of Kent, talking about CXXR.

Jan Vitek and Tomas Kalibera from Purdue University, talking about FastR.

Radford Neal of the University of Toronto, talking about pqR.

Alex Bertram from BeDataDriven, talking about Renjin.

Justin Talbot from Tableau Sfotware, talking about Riposte.

Lou Bajuk and Michael Sannella from Tibco, talking about TERR.

My interview with Andrew Runnalls was via phone, so his responses are paraphrased or imputed from my notes; likewise my recording of the conversation with Lou Bajuk and Michael Sannella failed, so their responses are also taken from notes. Other responses are taken from emails and Skype recordings, and as such are accurate.

I started by asking about the motivation for each project.

Before pqR, Radford had explored some possibilities for speed improvement in R.

Radford Neal:

Speed issues were what first prompted me to actually look at and modify the R interpreter. This came about when I happened to notice that (in R-2.11.1) curly brackets take less time than parentheses, and a*a is faster than a^2 when a is a long vector. This indicated to me that there must be considerable scope for improving R’s implementation. Previously, I hadn’t thought much about this, just assuming that R was close to some local optimum, so that large speed gains could be achieved
only by a major rewrite.

I’ve also commented on various design flaws in R, however. In the longer term, I’m interested in ways of fixing these, while retaining backwards compatibility.

Riposte started life as a PhD project.

Justin Talbot:

Riposte was started during my Ph.D. at Stanford. I started off creating tools for visualizing and manipulating statistical models. Most of these tools were built using R as the back end to perform data manipulation and to create statistical models. I found that I was unable to get the interactive performance I wanted out of R for my visualization tools. This led me to start exploring R’s performance. Since there were other individuals in my research group working on other programming languages, it was a semi-natural transition to start working on improving R’s performance.

The main goal of Riposte is to see if it is possible to execute a dynamically-typed vector language at near the speed of well-written optimized C.

CXXR began as a simple quest for a feature.

Andrew Runnalls:

I started CXXR in 2007 when I was researching flight trials data. One of the features I missed when moving from S-Plus was an audit feature that searches your command history to find the code that created a particular variable. This hadn’t been ported to R, so I wanted to see if I could alter R’s internals to recreate the feature. However, being a “dyed-in-the-wool” OO programmer, when I started looking at the interpreter’s C code, its structure was so foreign that I felt ‘I’d rather not start from here!’ Thus it was that the project metamorphosed into a more general enterprise to refactor the interpreter into C++, with the provenance-tracking objective then becoming secondary.

The Renjin project was born of frustration, and a ridiculous amount of optimism (or maybe naivety).

Alex Bertam:

We were with a mobile operator. We had just designed a great model to predict customer churn, and we were trying to get R to run on their servers. They had a weird version of Unix, we couldn’t get [GNU] R to build. We couldn’t get the libraries to build. We spent such a lot of time trying to get the libraries to talk to each other. Then it couldn’t handle the load from the sales team. There’s got to be a better way, and I thought ‘man, how hard can it be?’

TERR has been waiting to happen for a long time.

Lou Bajuk:

I joined Mathsoft [later Insightful Software, the previous owners of S-Plus] in 1996. Even back then, we wanted to rebuild the S_Plus engine to improve it, but Insightful was too small and we didn’t have the manpower.

Tibco’s main priority is selling enterprise software, including Spotfire, and once Tibco bought Insightful, we were better positioned to embrace R. It made sense to integrate the software so that open source R could be used as a backend for Spotfire, and then to implement TERR as an enterprise-grade platform for the R language.

My favourite reason of all for starting a project was the one given by Jan Vitek.

Jan Vitek:

My wife is a statistician, and she was complaining about something [with GNU R] and I claimed that we computer scientists could do better. “Show me”, she said.

In the next part, I tell you about the technical achievments of each project.

The tenure of Doctor Who incarnations

3rd August, 2013 6 comments

With a new actor being announced tomorrow, it got me pondering about the good Doctor. Specifically, who is the longest serving doctor? IMDB has the data:

whos <- data.frame(
  doctor = c(
    "William Hartnell",
    "Patrick Troughton",
    "Jon Pertwee",
    "Tom Baker",
    "Peter Davison",
    "Colin Baker",
    "Sylvester McCoy",
    "Paul McGann",
    "Christopher Ecclestone",
    "David Tennant",
    "Matt Smith"
  ),
  n_episodes = c(
    136,
    127,
    129,
    173,
    70,
    35,
    42,
    1,
    13,
    49,
    44
  ),
  stringsAsFactors = FALSE
)
whos$doctor <- factor(whos$doctor, levels = whos$doctor)

Let’s plot it to see how it changes over time.

library(ggplot2)
(bars <- ggplot(whos, aes(doctor, n_episodes)) +
  geom_bar(stat = "identity") +
  coord_flip()
) 

The plot shows that earlier doctor whos were typically in more episodes

There was a definite shift after Tom Baker towards a shorter term as the doctor. In terms of screen time, the shift is a little less pronounced due to the move from 25 minutes to 45 minutes per episode from Christopher Ecclestone onwards (and for one of the Colin Baker series).

Tags: , ,

The Secrets of Inverse Brogramming, reprise

27th July, 2013 Leave a comment

Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.

At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.

With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.

The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.

My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.

The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.

The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.

Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.

There’s a style guide over the page, here.

Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)

The sig package helps you determine whether your code lives up to this ideal.

The sig function prints a function signature.

library(sig)
sig(read.csv)
## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec
##         = ".", fill = TRUE, comment.char = "", ...)

So far, so unexciting. The args and formals functions do much the same thing. (In fact the sig function is just a wrapper to formals with a pretty print method.)

It gets more interesting, when you look at the signatures of lots of functions together. list_sigs prints all the sigs from a file or environment.

list_sigs(pkg2env(tools))
## add_datalist <- function(pkgpath, force = FALSE)
## 
## bibstyle <- function(style, envir, ..., .init = FALSE, .default =
##         TRUE)
## 
## buildVignettes <- function(package, dir, lib.loc = NULL, quiet =
##               TRUE, clean = TRUE, tangle = FALSE)
## 
## check_packages_in_dir <- function(dir, check_args = character(),
##                      check_args_db = list(), reverse = NULL,
##                      check_env = character(), xvfb = FALSE, Ncpus =
##                      getOption("Ncpus", 1), clean = TRUE, ...)

Even from just the first four signatures, you can see that the tools package has a style problem. add_datalist and check_packages_in_dir are lower_under_case, buildVignettes is lowerCamelCase, and bibstyle is plain lowercase.

I don’t want to pick on the tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.

write_sigs is a variant of list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.

There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)

The sig package also contains a sig_report function that identifies problem functions. This example uses the Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.

sig_report(
  pkg2env(Hmisc),
  too_many_args = 25,
  too_many_lines = 200
)
## The environment contains 509 variables of which 504 are functions. 
## Distribution of the number of input arguments to the functions:
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
##   4  62 117  90  41  41  22  17  14  17  15  10   8   4   6   3   1 
##  17  18  19  20  21  22  23  24  27  28  30  33  35  48  66 
##   4   5   3   2   4   2   2   2   2   1   1   1   1   1   1 
## These functions have more than 25 input args:
## [1] dotchart2                     event.chart                  
## [3] labcurve                      latex.default                
## [5] latex.summary.formula.reverse panel.xYplot                 
## [7] rlegend                       transcan                     
## Distribution of the number of lines of the functions:
##          1          2      [3,4]      [5,8]     [9,16]    [17,32] 
##          1         47         15         57         98        108 
##    [33,64]   [65,128]  [129,256]  [257,512] [513,1024] 
##         81         58         30          8          1 
## These functions have more than 200 lines:
##  [1] areg                         aregImpute                  
##  [3] event.chart                  event.history               
##  [5] format.df                    labcurve                    
##  [7] latex.default                panel.xYplot                
##  [9] plot.curveRep                plot.summary.formula.reverse
## [11] print.char.list              rcspline.plot               
## [13] redun                        rlegend                     
## [15] rm.boot                      sas.get                     
## [17] summary.formula              transcan

How R will turn into SQL

16th July, 2013 14 comments

Up until very recently the only way of running R code was through the standard R distribution. Of course you could use another IDE, but somewhere underneath it all you would be running the same, standard R engine from the R-core team.

This is no longer your only option. A couple of weeks ago Radford Neal released pqR, a research project that reimplements pieces of the R engine to make it “pretty quick”. Much of pqR is likely to be folded back into the main R engine, but that isn’t the only new player in town.

A team at Tibco, including software architect Michael Sannella, have rewritten the R engine from the ground up for TERR, a commercial R distribution. This leads to some very interesting possibilities. Maybe in the few years we could see many more R engines appearing. Oracle have recently invested heavily in R, and it’s certainly imaginable that they could create a high performance R that is tightly coupled to their database products. Google are also big R investors, and I could easily see them creating an R engine that has parallelism built in.

After that, perhaps even Microsoft might notice R and fix the fact that its integration with .NET is rubbish. IronR, anyone?

This situation isn’t as far fetched as you may think: there is a very obvious precedent in the data analysis world. There are dozens of different database vendors that use a common interface langauge: SQL. In the world of databases, the engine is separate from the programming language.

This is a good thing – multiple vendors provides competition, differentiation and innovation. You have a full spectrum of products from big corporate databases (Oracle again) down to miniature data stores like sqlite.

The fact that SQL has led the way means that some of the potential traps are visible before we embark down this road. My biggest bugbear with SQL is that it isn’t quite universal: different vendors have their own non-standard extensions, which means that not all SQL code is portable. A situation like HTML or C++ or Fortran, where a standards committee defines an official version of the language would be preferable. Whether R-core or another body would set this standard is a matter to be decided. (I suspect that R-core would not welcome the additional administrative burden, and commercial vendors may want more control in defining the spec, so a separate R-standards group is more likely.)

These are interesting times for R, and I look forward to seeing how the separation of language and engine progresses.

Bootnotes:

  1. A sloppy sentence in a previous version of this post made it sound like Michael Sannella ran Tibco. He’s actually a software architect.
  2. The future is happening faster than we think. Renjin and Riposte are two other R engines.
Tags: , , , , ,

user2013: The caret tutorial

9th July, 2013 Leave a comment

This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.

The basic usage of caret is to split your data into training and test sets.

my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p
names(my_data) <- c("training", "testing")

Then call train on your training set.

training_model <- train(
  response ~ ., 
  data   = my_data$training,
  method = "a type of model!")

Then predict it using predict.

predictions <- predict(training_model, my_data$testing)

So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.

Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.

I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.

Follow

Get every new post delivered to your Inbox.

Join 228 other followers