the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.
You’ll get to hear about:
- Writing stylish code.
- Finding bad functions with the sig package.
- Writing robust code with the assertive package.
- Testing your code with the testthat package.
- Documenting your code with the roxygen2 package.
It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.
Including my hot-off-the-press Learning R. Buy two copies!
Back in June I discovered pqR, Radford Neal’s fork of R designed to improve performance. Then in July, I heard about Tibco’s TERR, a C++ rewrite of the R engine suitable for the enterprise. At this point it dawned on me that R might end up like SQL, with many different implementations of a common language suitable for different purposes.
As it turned out, the future is nearer than I thought. As well as pqR and TERR, there are four other projects: Renjin, a Java-based rewrite that makes it easy to integrate with Java software and has some performance benefits; fastR, another Java-based engine focused on performance; Riposte, a C++ rewrite that also focuses on performance; and CXXR, a set of C++ modifications to GNU R that focus on maintainability and extensibility.
I think that having a choice of R engine is a good thing. The development model of one implementation with a team acting as gatekeepers as to what goes into the codebase is, well, a bit communist. Having alternatives should introduce competition and (go capitalism!) bring improvements at a greater rate.
The fact that we now have a choice of seven different engines for running R code is amazing news, but it takes a traditional R problem to a new level. Rather than just worrying about which implementation of an algorithm to use in which package, you now have to worry about which version of R to use.
In order to try and figure out which projects might be suitable for which purposes, and what stage each one is at, I spoke to members of each project. In alphabetical order of project, they were:
My interview with Andrew Runnalls was via phone, so his responses are paraphrased or imputed from my notes; likewise my recording of the conversation with Lou Bajuk and Michael Sannella failed, so their responses are also taken from notes. Other responses are taken from emails and Skype recordings, and as such are accurate.
I started by asking about the motivation for each project.
Before pqR, Radford had explored some possibilities for speed improvement in R.
Speed issues were what first prompted me to actually look at and modify the R interpreter. This came about when I happened to notice that (in R-2.11.1) curly brackets take less time than parentheses, and a*a is faster than a^2 when a is a long vector. This indicated to me that there must be considerable scope for improving R’s implementation. Previously, I hadn’t thought much about this, just assuming that R was close to some local optimum, so that large speed gains could be achieved
only by a major rewrite.
I’ve also commented on various design flaws in R, however. In the longer term, I’m interested in ways of fixing these, while retaining backwards compatibility.
Riposte started life as a PhD project.
Riposte was started during my Ph.D. at Stanford. I started off creating tools for visualizing and manipulating statistical models. Most of these tools were built using R as the back end to perform data manipulation and to create statistical models. I found that I was unable to get the interactive performance I wanted out of R for my visualization tools. This led me to start exploring R’s performance. Since there were other individuals in my research group working on other programming languages, it was a semi-natural transition to start working on improving R’s performance.
The main goal of Riposte is to see if it is possible to execute a dynamically-typed vector language at near the speed of well-written optimized C.
CXXR began as a simple quest for a feature.
I started CXXR in 2007 when I was researching flight trials data. One of the features I missed when moving from S-Plus was an audit feature that searches your command history to find the code that created a particular variable. This hadn’t been ported to R, so I wanted to see if I could alter R’s internals to recreate the feature. However, being a “dyed-in-the-wool” OO programmer, when I started looking at the interpreter’s C code, its structure was so foreign that I felt ‘I’d rather not start from here!’ Thus it was that the project metamorphosed into a more general enterprise to refactor the interpreter into C++, with the provenance-tracking objective then becoming secondary.
The Renjin project was born of frustration, and a ridiculous amount of optimism (or maybe naivety).
We were with a mobile operator. We had just designed a great model to predict customer churn, and we were trying to get R to run on their servers. They had a weird version of Unix, we couldn’t get [GNU] R to build. We couldn’t get the libraries to build. We spent such a lot of time trying to get the libraries to talk to each other. Then it couldn’t handle the load from the sales team. There’s got to be a better way, and I thought ‘man, how hard can it be?’
TERR has been waiting to happen for a long time.
I joined Mathsoft [later Insightful Software, the previous owners of S-Plus] in 1996. Even back then, we wanted to rebuild the S_Plus engine to improve it, but Insightful was too small and we didn’t have the manpower.
Tibco’s main priority is selling enterprise software, including Spotfire, and once Tibco bought Insightful, we were better positioned to embrace R. It made sense to integrate the software so that open source R could be used as a backend for Spotfire, and then to implement TERR as an enterprise-grade platform for the R language.
My favourite reason of all for starting a project was the one given by Jan Vitek.
My wife is a statistician, and she was complaining about something [with GNU R] and I claimed that we computer scientists could do better. “Show me”, she said.
In the next part, I tell you about the technical achievments of each project.
whos <- data.frame( doctor = c( "William Hartnell", "Patrick Troughton", "Jon Pertwee", "Tom Baker", "Peter Davison", "Colin Baker", "Sylvester McCoy", "Paul McGann", "Christopher Ecclestone", "David Tennant", "Matt Smith" ), n_episodes = c( 136, 127, 129, 173, 70, 35, 42, 1, 13, 49, 44 ), stringsAsFactors = FALSE ) whos$doctor <- factor(whos$doctor, levels = whos$doctor)
Let’s plot it to see how it changes over time.
library(ggplot2) (bars <- ggplot(whos, aes(doctor, n_episodes)) + geom_bar(stat = "identity") + coord_flip() )
There was a definite shift after Tom Baker towards a shorter term as the doctor. In terms of screen time, the shift is a little less pronounced due to the move from 25 minutes to 45 minutes per episode from Christopher Ecclestone onwards (and for one of the Colin Baker series).
Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.
At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.
With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.
The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.
My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.
The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.
The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.
Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.
There’s a style guide over the page, here.
Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)
sig package helps you determine whether your code lives up to this ideal.
sig function prints a function signature.
library(sig) sig(read.csv) ## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec ## = ".", fill = TRUE, comment.char = "", ...)
So far, so unexciting. The
formals functions do much the same thing. (In fact the
sig function is just a wrapper to
formals with a pretty
It gets more interesting, when you look at the signatures of lots of functions together.
list_sigs prints all the sigs from a file or environment.
list_sigs(pkg2env(tools)) ## add_datalist <- function(pkgpath, force = FALSE) ## ## bibstyle <- function(style, envir, ..., .init = FALSE, .default = ## TRUE) ## ## buildVignettes <- function(package, dir, lib.loc = NULL, quiet = ## TRUE, clean = TRUE, tangle = FALSE) ## ## check_packages_in_dir <- function(dir, check_args = character(), ## check_args_db = list(), reverse = NULL, ## check_env = character(), xvfb = FALSE, Ncpus = ## getOption("Ncpus", 1), clean = TRUE, ...)
Even from just the first four signatures, you can see that the
tools package has a style problem.
check_packages_in_dir are lower_under_case,
buildVignettes is lowerCamelCase, and
bibstyle is plain lowercase.
I don’t want to pick on the
tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.
write_sigs is a variant of
list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.
There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)
sig package also contains a
sig_report function that identifies problem functions. This example uses the
Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.
sig_report( pkg2env(Hmisc), too_many_args = 25, too_many_lines = 200 ) ## The environment contains 509 variables of which 504 are functions. ## Distribution of the number of input arguments to the functions: ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 4 62 117 90 41 41 22 17 14 17 15 10 8 4 6 3 1 ## 17 18 19 20 21 22 23 24 27 28 30 33 35 48 66 ## 4 5 3 2 4 2 2 2 2 1 1 1 1 1 1 ## These functions have more than 25 input args: ##  dotchart2 event.chart ##  labcurve latex.default ##  latex.summary.formula.reverse panel.xYplot ##  rlegend transcan ## Distribution of the number of lines of the functions: ## 1 2 [3,4] [5,8] [9,16] [17,32] ## 1 47 15 57 98 108 ## [33,64] [65,128] [129,256] [257,512] [513,1024] ## 81 58 30 8 1 ## These functions have more than 200 lines: ##  areg aregImpute ##  event.chart event.history ##  format.df labcurve ##  latex.default panel.xYplot ##  plot.curveRep plot.summary.formula.reverse ##  print.char.list rcspline.plot ##  redun rlegend ##  rm.boot sas.get ##  summary.formula transcan
Up until very recently the only way of running R code was through the standard R distribution. Of course you could use another IDE, but somewhere underneath it all you would be running the same, standard R engine from the R-core team.
This is no longer your only option. A couple of weeks ago Radford Neal released pqR, a research project that reimplements pieces of the R engine to make it “pretty quick”. Much of pqR is likely to be folded back into the main R engine, but that isn’t the only new player in town.
A team at Tibco, including software architect Michael Sannella, have rewritten the R engine from the ground up for TERR, a commercial R distribution. This leads to some very interesting possibilities. Maybe in the few years we could see many more R engines appearing. Oracle have recently invested heavily in R, and it’s certainly imaginable that they could create a high performance R that is tightly coupled to their database products. Google are also big R investors, and I could easily see them creating an R engine that has parallelism built in.
After that, perhaps even Microsoft might notice R and fix the fact that its integration with .NET is rubbish. IronR, anyone?
This situation isn’t as far fetched as you may think: there is a very obvious precedent in the data analysis world. There are dozens of different database vendors that use a common interface langauge: SQL. In the world of databases, the engine is separate from the programming language.
This is a good thing – multiple vendors provides competition, differentiation and innovation. You have a full spectrum of products from big corporate databases (Oracle again) down to miniature data stores like sqlite.
The fact that SQL has led the way means that some of the potential traps are visible before we embark down this road. My biggest bugbear with SQL is that it isn’t quite universal: different vendors have their own non-standard extensions, which means that not all SQL code is portable. A situation like HTML or C++ or Fortran, where a standards committee defines an official version of the language would be preferable. Whether R-core or another body would set this standard is a matter to be decided. (I suspect that R-core would not welcome the additional administrative burden, and commercial vendors may want more control in defining the spec, so a separate R-standards group is more likely.)
These are interesting times for R, and I look forward to seeing how the separation of language and engine progresses.
This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.
The basic usage of caret is to split your data into training and test sets.
my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p names(my_data) <- c("training", "testing")
train on your training set.
training_model <- train( response ~ ., data = my_data$training, method = "a type of model!")
Then predict it using
predictions <- predict(training_model, my_data$testing)
So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.
Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.
I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.