Archive

Posts Tagged ‘Renjin’

Fearsome Engines Part 3: Which one should you use?

13th October, 2013 3 comments

There are lots of R engines emerging! I’ve interviewed members of each of the teams involved in these projects. In part 1 of this series, we covered the motivation of each project. Part 2 looked at the technical achievements and new features. This part tries to determine which projects are suitable for which users.

Compatibility

CXXR and pqR and forks of GNU R, and have retained full package compatibility. This means that they should work out of the box on your existing R code (though note that both are Linux only and require you to build from source).

The other four engines are complete rebuilds, and consequently have had to recreate compatibility.

TERR currently has over 1200 completely compatible packages, and many more partially compatible ones.

Renjin also has, I think, over one thousand compatible packages. Helpfully, there’s a list of which CRAN packages work, and which don’t. (I haven’t bothered counting the blue dots, but it looks like almost half the packages are compatible.)

Riposte and FastR are at an earlier stage in development FastR has no package support at all, the Riposte is just getting around to it.

Justin Talbot:

A couple of months back I started a big push to transition Riposte from an academic project to a full-featured drop-in replacement for R. Right now, Riposte can load the base and recommended packages with just a few errors, which is a substantial improvement over a couple months back when it couldn’t load even one.

Interestingly, one of the big headaches reported by several engine developers is that some packages make assumptions about the internals of R. So they assume things like numeric vectors being pointers to double arrays, and it makes it harder for other engines to extend the functionality. It seems that some work is needed to the R Language definition to clarify exactly what a package should and should not be allowed to assume.

Licensing

pqR, CXXR, Renjin and FastR are all licensed under the GPL. Riposte is also open source, under the more permissive 2-clause BSD licence.

TERR, by contrast is a closed-source product. It comes in both free and paid for versions.

Ecosystem

Having lots of different engines is mostly great, but fragmentation is a big worry. Web development suffered for a long time (and still does, a little bit) from having to write different code for different browsers. SQL is subtly different for different databases, with vendor-specific extensions rampant.

All the engine developers have stated their intention to work towards (or retain) full compatibility with GNU R, and use that as the reference implementation. Certainly fragmentation is a problem that no-one wants.

Good intentions are all very well, but I was curious to know how much interaction there has been between the different teams, and with R-Core.

Between teams:

Tomas Kalibera previously worked at the University of Kent, where Andrew Runnalls is based. Surprisingly, they didn’t have much contact, as they were working in different areas at the time.

Jan Vitek was on Justin Talbot’s dissertation committee, so there has been some contact between the FastR and Riposte teams.

Justin has also spoken with Alex Bertram of Renjin.

Overall, there hasn’t been that much contact. I think I’d be less worried about fragmentation if the teams talked with each other more. (TERR is an exception in this; as a commercial product, a certain level of secrecy is required.)

A few R-Core names crop up in relation to several projects:

Doug Bates has been involved with CXXR.

Duncan Murdoch helped get some of pqR’s bug fixed merged into GNU R, and Radford Neal has had some contact with Luke Tierney and Robert Gentleman.

Luke Tierney has helped with the bytecode compilation in Renjin.

John Chambers, Ross Ihaka and Simon Urbanek have provided feedback on the TERR project.

Luke Tierney, Duncan Temple-Lang and Robert Gentleman have provided feedback to FastR.

Luke Tierney has helped on Riposte.

So at least some members of R-Core are aware of these projects, and there is some collaboration going on. Personally, I’d quite like to see a more formal arrangement with a language committee trying to refine the R Langauge Definition to aid compatibility between projects. This is probably just a legacy of my time as a civil servant.

The FastR team organized a number of workshops that featured members of the Core-R team (one on virtual execution environments for scientific computing and one on big data). After the SPLASH conference in October 2013, key developers from the Renjin and TERR engines will meet up with the FastR team, along with some Core-R members at an NSF workshop on scalable data analytics organized by Vitek.

The future

As a commercial project, TERR lives or dies by its ability to be sold (it does have a real customer base already, predominantly in the oil & gas, life sciences and consumer goods sectors).

Renjin is supported by BeDataDriven’s consultancy work, so it also has ongoing financial support.

The other four projects are all academic, so their long term support is trickier.

Justin works for Tableau Software, and they let him develop Riposte in his 20% time, but additional developers would help the project.

Justin Talbot:

Right now the team is just me. Zach has moved on to his dissertation work, the very cool Terra project (http://terralang.org). I’d be more than happy to expand the team if anyone is interested in helping out. Right now Riposte is at a place where there is a lot of easily factorable work in supporting popular packages or R’s internal functions that people could pick up if they wanted to. As I said at the beginning, Riposte’s overriding goal is to figure out how to execute dynamically-typed vector code like R at near the performance of highly optimized C. I think this sets it apart from a number of the other R projects like CXX, Renjin, and pqR, where better performance is nice, but not the driving design goal. If Riposte’s mission interests anyone, please get in contact!

CXXR and pqR are also solo projects and would benefit from developers.

Radford Neal:

I’m treating pqR as a fork, without any expectation that it will necessarily be merged with the version maintained by the R Core Team.

One necessary thread of work on pqR consists of fixing bugs, getting it to work in more environments (Windows, Mac, with GUIs), and adding features from later R Core releases. I’d like to have it compatible soon with R-2.15.1, and later with at least R-2.15.3.

There are still many areas in which performance can be improved
without major design changes.

At some point I’d like to get back to doing statistics, so I hope that
other people will also get involved with pqR.

FastR has a long term goal of being maintained by statisticians and not computer scientists!

Jan Vitek:

The hope, in an ideal world, is eventually that we could turn FastR over to R-Core.

As well as developers, of course, the biggest thing that these projects need is a userbase. Feedback and bug reporting is hugely important, so a great way for you to contribute is to simply to try these projects out. Failing that, telling other R users that these projects exists is an excellent start. Get tweeting!

Fearsome Engines, Part 1

7th September, 2013 17 comments

Back in June I discovered pqR, Radford Neal’s fork of R designed to improve performance. Then in July, I heard about Tibco’s TERR, a C++ rewrite of the R engine suitable for the enterprise. At this point it dawned on me that R might end up like SQL, with many different implementations of a common language suitable for different purposes.

As it turned out, the future is nearer than I thought. As well as pqR and TERR, there are four other projects: Renjin, a Java-based rewrite that makes it easy to integrate with Java software and has some performance benefits; fastR, another Java-based engine focused on performance; Riposte, a C++ rewrite that also focuses on performance; and CXXR, a set of C++ modifications to GNU R that focus on maintainability and extensibility.

I think that having a choice of R engine is a good thing. The development model of one implementation with a team acting as gatekeepers as to what goes into the codebase is, well, a bit communist. Having alternatives should introduce competition and (go capitalism!) bring improvements at a greater rate.

The fact that we now have a choice of seven different engines for running R code is amazing news, but it takes a traditional R problem to a new level. Rather than just worrying about which implementation of an algorithm to use in which package, you now have to worry about which version of R to use.

In order to try and figure out which projects might be suitable for which purposes, and what stage each one is at, I spoke to members of each project. In alphabetical order of project, they were:

Andrew Runnalls from the University of Kent, talking about CXXR.

Jan Vitek and Tomas Kalibera from Purdue University, talking about FastR.

Radford Neal of the University of Toronto, talking about pqR.

Alex Bertram from BeDataDriven, talking about Renjin.

Justin Talbot from Tableau Sfotware, talking about Riposte.

Lou Bajuk and Michael Sannella from Tibco, talking about TERR.

My interview with Andrew Runnalls was via phone, so his responses are paraphrased or imputed from my notes; likewise my recording of the conversation with Lou Bajuk and Michael Sannella failed, so their responses are also taken from notes. Other responses are taken from emails and Skype recordings, and as such are accurate.

I started by asking about the motivation for each project.

Before pqR, Radford had explored some possibilities for speed improvement in R.

Radford Neal:

Speed issues were what first prompted me to actually look at and modify the R interpreter. This came about when I happened to notice that (in R-2.11.1) curly brackets take less time than parentheses, and a*a is faster than a^2 when a is a long vector. This indicated to me that there must be considerable scope for improving R’s implementation. Previously, I hadn’t thought much about this, just assuming that R was close to some local optimum, so that large speed gains could be achieved
only by a major rewrite.

I’ve also commented on various design flaws in R, however. In the longer term, I’m interested in ways of fixing these, while retaining backwards compatibility.

Riposte started life as a PhD project.

Justin Talbot:

Riposte was started during my Ph.D. at Stanford. I started off creating tools for visualizing and manipulating statistical models. Most of these tools were built using R as the back end to perform data manipulation and to create statistical models. I found that I was unable to get the interactive performance I wanted out of R for my visualization tools. This led me to start exploring R’s performance. Since there were other individuals in my research group working on other programming languages, it was a semi-natural transition to start working on improving R’s performance.

The main goal of Riposte is to see if it is possible to execute a dynamically-typed vector language at near the speed of well-written optimized C.

CXXR began as a simple quest for a feature.

Andrew Runnalls:

I started CXXR in 2007 when I was researching flight trials data. One of the features I missed when moving from S-Plus was an audit feature that searches your command history to find the code that created a particular variable. This hadn’t been ported to R, so I wanted to see if I could alter R’s internals to recreate the feature. However, being a “dyed-in-the-wool” OO programmer, when I started looking at the interpreter’s C code, its structure was so foreign that I felt ‘I’d rather not start from here!’ Thus it was that the project metamorphosed into a more general enterprise to refactor the interpreter into C++, with the provenance-tracking objective then becoming secondary.

The Renjin project was born of frustration, and a ridiculous amount of optimism (or maybe naivety).

Alex Bertam:

We were with a mobile operator. We had just designed a great model to predict customer churn, and we were trying to get R to run on their servers. They had a weird version of Unix, we couldn’t get [GNU] R to build. We couldn’t get the libraries to build. We spent such a lot of time trying to get the libraries to talk to each other. Then it couldn’t handle the load from the sales team. There’s got to be a better way, and I thought ‘man, how hard can it be?’

TERR has been waiting to happen for a long time.

Lou Bajuk:

I joined Mathsoft [later Insightful Software, the previous owners of S-Plus] in 1996. Even back then, we wanted to rebuild the S_Plus engine to improve it, but Insightful was too small and we didn’t have the manpower.

Tibco’s main priority is selling enterprise software, including Spotfire, and once Tibco bought Insightful, we were better positioned to embrace R. It made sense to integrate the software so that open source R could be used as a backend for Spotfire, and then to implement TERR as an enterprise-grade platform for the R language.

My favourite reason of all for starting a project was the one given by Jan Vitek.

Jan Vitek:

My wife is a statistician, and she was complaining about something [with GNU R] and I claimed that we computer scientists could do better. “Show me”, she said.

In the next part, I tell you about the technical achievments of each project.