the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.
You’ll get to hear about:
- Writing stylish code.
- Finding bad functions with the sig package.
- Writing robust code with the assertive package.
- Testing your code with the testthat package.
- Documenting your code with the roxygen2 package.
It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.
Including my hot-off-the-press Learning R. Buy two copies!
Back in June I discovered pqR, Radford Neal’s fork of R designed to improve performance. Then in July, I heard about Tibco’s TERR, a C++ rewrite of the R engine suitable for the enterprise. At this point it dawned on me that R might end up like SQL, with many different implementations of a common language suitable for different purposes.
As it turned out, the future is nearer than I thought. As well as pqR and TERR, there are four other projects: Renjin, a Java-based rewrite that makes it easy to integrate with Java software and has some performance benefits; fastR, another Java-based engine focused on performance; Riposte, a C++ rewrite that also focuses on performance; and CXXR, a set of C++ modifications to GNU R that focus on maintainability and extensibility.
I think that having a choice of R engine is a good thing. The development model of one implementation with a team acting as gatekeepers as to what goes into the codebase is, well, a bit communist. Having alternatives should introduce competition and (go capitalism!) bring improvements at a greater rate.
The fact that we now have a choice of seven different engines for running R code is amazing news, but it takes a traditional R problem to a new level. Rather than just worrying about which implementation of an algorithm to use in which package, you now have to worry about which version of R to use.
In order to try and figure out which projects might be suitable for which purposes, and what stage each one is at, I spoke to members of each project. In alphabetical order of project, they were:
My interview with Andrew Runnalls was via phone, so his responses are paraphrased or imputed from my notes; likewise my recording of the conversation with Lou Bajuk and Michael Sannella failed, so their responses are also taken from notes. Other responses are taken from emails and Skype recordings, and as such are accurate.
I started by asking about the motivation for each project.
Before pqR, Radford had explored some possibilities for speed improvement in R.
Speed issues were what first prompted me to actually look at and modify the R interpreter. This came about when I happened to notice that (in R-2.11.1) curly brackets take less time than parentheses, and a*a is faster than a^2 when a is a long vector. This indicated to me that there must be considerable scope for improving R’s implementation. Previously, I hadn’t thought much about this, just assuming that R was close to some local optimum, so that large speed gains could be achieved
only by a major rewrite.
I’ve also commented on various design flaws in R, however. In the longer term, I’m interested in ways of fixing these, while retaining backwards compatibility.
Riposte started life as a PhD project.
Riposte was started during my Ph.D. at Stanford. I started off creating tools for visualizing and manipulating statistical models. Most of these tools were built using R as the back end to perform data manipulation and to create statistical models. I found that I was unable to get the interactive performance I wanted out of R for my visualization tools. This led me to start exploring R’s performance. Since there were other individuals in my research group working on other programming languages, it was a semi-natural transition to start working on improving R’s performance.
The main goal of Riposte is to see if it is possible to execute a dynamically-typed vector language at near the speed of well-written optimized C.
CXXR began as a simple quest for a feature.
I started CXXR in 2007 when I was researching flight trials data. One of the features I missed when moving from S-Plus was an audit feature that searches your command history to find the code that created a particular variable. This hadn’t been ported to R, so I wanted to see if I could alter R’s internals to recreate the feature. However, being a “dyed-in-the-wool” OO programmer, when I started looking at the interpreter’s C code, its structure was so foreign that I felt ‘I’d rather not start from here!’ Thus it was that the project metamorphosed into a more general enterprise to refactor the interpreter into C++, with the provenance-tracking objective then becoming secondary.
The Renjin project was born of frustration, and a ridiculous amount of optimism (or maybe naivety).
We were with a mobile operator. We had just designed a great model to predict customer churn, and we were trying to get R to run on their servers. They had a weird version of Unix, we couldn’t get [GNU] R to build. We couldn’t get the libraries to build. We spent such a lot of time trying to get the libraries to talk to each other. Then it couldn’t handle the load from the sales team. There’s got to be a better way, and I thought ‘man, how hard can it be?’
TERR has been waiting to happen for a long time.
I joined Mathsoft [later Insightful Software, the previous owners of S-Plus] in 1996. Even back then, we wanted to rebuild the S_Plus engine to improve it, but Insightful was too small and we didn’t have the manpower.
Tibco’s main priority is selling enterprise software, including Spotfire, and once Tibco bought Insightful, we were better positioned to embrace R. It made sense to integrate the software so that open source R could be used as a backend for Spotfire, and then to implement TERR as an enterprise-grade platform for the R language.
My favourite reason of all for starting a project was the one given by Jan Vitek.
My wife is a statistician, and she was complaining about something [with GNU R] and I claimed that we computer scientists could do better. “Show me”, she said.
In the next part, I tell you about the technical achievments of each project.