Home > R > Fearsome Engines, Part 1

Fearsome Engines, Part 1

Back in June I discovered pqR, Radford Neal’s fork of R designed to improve performance. Then in July, I heard about Tibco’s TERR, a C++ rewrite of the R engine suitable for the enterprise. At this point it dawned on me that R might end up like SQL, with many different implementations of a common language suitable for different purposes.

As it turned out, the future is nearer than I thought. As well as pqR and TERR, there are four other projects: Renjin, a Java-based rewrite that makes it easy to integrate with Java software and has some performance benefits; fastR, another Java-based engine focused on performance; Riposte, a C++ rewrite that also focuses on performance; and CXXR, a set of C++ modifications to GNU R that focus on maintainability and extensibility.

I think that having a choice of R engine is a good thing. The development model of one implementation with a team acting as gatekeepers as to what goes into the codebase is, well, a bit communist. Having alternatives should introduce competition and (go capitalism!) bring improvements at a greater rate.

The fact that we now have a choice of seven different engines for running R code is amazing news, but it takes a traditional R problem to a new level. Rather than just worrying about which implementation of an algorithm to use in which package, you now have to worry about which version of R to use.

In order to try and figure out which projects might be suitable for which purposes, and what stage each one is at, I spoke to members of each project. In alphabetical order of project, they were:

Andrew Runnalls from the University of Kent, talking about CXXR.

Jan Vitek and Tomas Kalibera from Purdue University, talking about FastR.

Radford Neal of the University of Toronto, talking about pqR.

Alex Bertram from BeDataDriven, talking about Renjin.

Justin Talbot from Tableau Sfotware, talking about Riposte.

Lou Bajuk and Michael Sannella from Tibco, talking about TERR.

My interview with Andrew Runnalls was via phone, so his responses are paraphrased or imputed from my notes; likewise my recording of the conversation with Lou Bajuk and Michael Sannella failed, so their responses are also taken from notes. Other responses are taken from emails and Skype recordings, and as such are accurate.

I started by asking about the motivation for each project.

Before pqR, Radford had explored some possibilities for speed improvement in R.

Radford Neal:

Speed issues were what first prompted me to actually look at and modify the R interpreter. This came about when I happened to notice that (in R-2.11.1) curly brackets take less time than parentheses, and a*a is faster than a^2 when a is a long vector. This indicated to me that there must be considerable scope for improving R’s implementation. Previously, I hadn’t thought much about this, just assuming that R was close to some local optimum, so that large speed gains could be achieved
only by a major rewrite.

I’ve also commented on various design flaws in R, however. In the longer term, I’m interested in ways of fixing these, while retaining backwards compatibility.

Riposte started life as a PhD project.

Justin Talbot:

Riposte was started during my Ph.D. at Stanford. I started off creating tools for visualizing and manipulating statistical models. Most of these tools were built using R as the back end to perform data manipulation and to create statistical models. I found that I was unable to get the interactive performance I wanted out of R for my visualization tools. This led me to start exploring R’s performance. Since there were other individuals in my research group working on other programming languages, it was a semi-natural transition to start working on improving R’s performance.

The main goal of Riposte is to see if it is possible to execute a dynamically-typed vector language at near the speed of well-written optimized C.

CXXR began as a simple quest for a feature.

Andrew Runnalls:

I started CXXR in 2007 when I was researching flight trials data. One of the features I missed when moving from S-Plus was an audit feature that searches your command history to find the code that created a particular variable. This hadn’t been ported to R, so I wanted to see if I could alter R’s internals to recreate the feature. However, being a “dyed-in-the-wool” OO programmer, when I started looking at the interpreter’s C code, its structure was so foreign that I felt ‘I’d rather not start from here!’ Thus it was that the project metamorphosed into a more general enterprise to refactor the interpreter into C++, with the provenance-tracking objective then becoming secondary.

The Renjin project was born of frustration, and a ridiculous amount of optimism (or maybe naivety).

Alex Bertam:

We were with a mobile operator. We had just designed a great model to predict customer churn, and we were trying to get R to run on their servers. They had a weird version of Unix, we couldn’t get [GNU] R to build. We couldn’t get the libraries to build. We spent such a lot of time trying to get the libraries to talk to each other. Then it couldn’t handle the load from the sales team. There’s got to be a better way, and I thought ‘man, how hard can it be?’

TERR has been waiting to happen for a long time.

Lou Bajuk:

I joined Mathsoft [later Insightful Software, the previous owners of S-Plus] in 1996. Even back then, we wanted to rebuild the S_Plus engine to improve it, but Insightful was too small and we didn’t have the manpower.

Tibco’s main priority is selling enterprise software, including Spotfire, and once Tibco bought Insightful, we were better positioned to embrace R. It made sense to integrate the software so that open source R could be used as a backend for Spotfire, and then to implement TERR as an enterprise-grade platform for the R language.

My favourite reason of all for starting a project was the one given by Jan Vitek.

Jan Vitek:

My wife is a statistician, and she was complaining about something [with GNU R] and I claimed that we computer scientists could do better. “Show me”, she said.

In the next part, I tell you about the technical achievments of each project.

About these ads
  1. 7th September, 2013 at 16:40 pm

    Reblogged this on DECISION STATS.

  2. 7th September, 2013 at 17:31 pm

    Also include Revolution Analytics in the list.

  3. 7th September, 2013 at 18:24 pm

    I remember when Ruby went through this “phase”. It wasn’t pretty – it was fricking chaos! The only good thing that came out of it was a huge test suite that Ruby engine makers could test their implementations against. I gave up on Ruby as a result – I found I needed at least two versions of Ruby to get my jobs done and either ‘rbenv’ or ‘rvm’ to “manage” them.

    R will lose the battle for “hearts and minds” if it goes down this path. Python 3 and JavaScript will wipe the floor with us. We’re a niche language as it is, but when people discover they can do serious data mining in the browser and in Node.js backed by NoSQL databases, R’s history – all seven or ten engines notwithstanding.

    • 9th September, 2013 at 16:26 pm

      Yes Python is compelling but I do not see Node.js being commonly used for statistics anytime soon. Uhh how does NoSQL supplant data frames?

      This is not to say node.js isn’t great for real-time web services, but most data mining projects do not have to finish in the time it takes someone to click a button.

  4. 7th September, 2013 at 20:25 pm

    Really great blog Richie!

    Andrew Runall gave talks on CXXR at the useR!’s in Washington and Coventry (2010-2011 if I remember well), which gave me the idea that he was way earlier than the other engine builders. At the Coventry conference he showed a very compact example of how an infinite precision integer library could be hooked into CXXR with a few lines of templated C++. Impressive.

    Someone in the audience asked why GNU R is not written in C++. Brian Ripley, who attended the talk, replied that for a long time no (standards compliant) C++ compiler was available on every system that the core teams wants to support.

    I remember that around that time there was some online discussion between proponents of rebuilding R completely, or replacing it with something else all together (see also the buzz that julia is generating), and people like Radford Neal who see lots of potential in optimizing the current engine. It seems that the ‘battle’ has begun for real now, which I think is very exciting. To quote Frank Zappa: without deviation from the norm, progress is not possible.

    These are good days for statistical computeers.

  5. 8th September, 2013 at 1:05 am

    Hi Richie,
    I am happy you picked up on this topic, I am curious to read all of your series :)

    One small idea: consider contacting Dirk or Romain to either give some input on the role of Rcpp in this “debate”. And also since Dirk is the maintainer of the HPC task view (which means he is VERY likely to have interesting things to add).

    Yours,
    Tal

  6. abe
    23rd September, 2013 at 7:34 am

    Another interesting attribute to discuss for each of these projects is the license. With a complete independent rewrite of the interpreter comes the chance to rid the language of the vile GPL. It appears that Riposte is the only nice player here, although I couldn’t find the license on FastR (Renjin = GPL3???? – thanks, now all my packages have to be GPL). However, I suppose this discussion is moot since any interpreter will need base R to be useful, and no one wants to or can rewrite all of that, and then a good 95+% of R packages are GPL, spreading on the viral nastiness of GPL to everything that touches them.

    • 23rd September, 2013 at 10:45 am

      Not sure that “vile” is the right adjective for the GPL. It’s purpose is to ensure liberation of source code, and it is very effective at doing that.

      I’ll talk about licenses when I get around to writing the next part.

    • 7th October, 2013 at 9:18 am

      I would prefer a more liberal license for Renjin as well, but I feel compelled to honor GNU R’s license given how much we’ve relied on it, particularly the R-language base packages which we use verbatim.

      I for one would like to understand the GPL a bit better– for example, does using Renjin through the javax.script APIs permit you to license your project as you choose? Do we need to grant something called a “classpath” exception? Should we be licensing some parts of the project with LGPL to enable “linking,” the meaning of which is unclear to me in the JVM world.

      And what do you mean by GPL3 => all your packages have to be GPL?

      If anyone can help with these sorts of questions, DM at alex@bedatadriven.com !

  1. 24th September, 2013 at 23:43 pm
  2. 13th October, 2013 at 15:11 pm
  3. 13th October, 2013 at 15:14 pm
  4. 3rd January, 2014 at 0:10 am
  5. 3rd January, 2014 at 0:10 am
  6. 3rd January, 2014 at 0:41 am
  7. 3rd January, 2014 at 1:01 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 228 other followers

%d bloggers like this: