Posts Tagged ‘r-summit’

Everyone loves R markdown and Github; stories from the R Summit, day two

28th June, 2015 6 comments

More excellent talks today!

Andrie de Vries of Microsoft kicked off today’s talks with a demo of checkpoint. This is his package for assisting reproducibility by letting you install packages from a specific date.

The idea is that a lot of R packages, particularly those from PhD projects don’t get maintained, and suffer bitrot. That means that they often don’t work with current versions of R and current packages.

Since I now work in a university, and our team is trying to make sure we release an accompanying package of the data analysis steps with every paper, having a system for reproducibility is important to me. I’ve played around with packrat, and it is nice but a bit too much effort to bother with on a day-to-day basis. I’ve not used checkpoint before, but from Andrie’s demo it seems a little easier. You just add these lines to the top of your script:

checkpoint("2014-08-18", R.version = "3.0.3")

and it checks you are running the correct version of R, downloads the packages that you are about to run, from an archived version of CRAN on the date you suggested, and sets them up as a new library. Easy reproducibility.

Jeroen Ooms of UCLA (sort of) talked about his streaming suite of packages: curl, jsonlite and mongolite, for downloading web data, converting JSON to and from R objects, and working with MongoDB databases.

There have been other packages around for each of these three tasks, but the selling point is that Jeroen is a disgustingly talented coder and has written definitive versions.

jsonlite takes care to be consistent about the weird edge cases when converting between R and JSON. (The JSON spec doesn’t support infinity or NA, so you get a bit of control about what happens there, for example.)

jsonlite also supports ndjson, where each line is a JSON object. This is important for large files: you can just parse one line at a time, then return the whole thing as a list or data frame, or you can define a line specific behaviour.

This last case is used for streaming functionality. You use stream_in to read an ndjson file, then parse a line, manipulate the result, maybe stream_out to somewhere else, then move on to the next line. It made me wissh I was doing fancy things with the twitter firehose.

The MongoDB stuff also sounded interesting; I didn’t quite get a grasp of what makes it better than the other mongo packages, but he gave some examples of fast document searching.

Gabor Csardi of Harvard University talked about METACRAN, which is his spare time hobby.

One of the big sticking points in many people’s work in R is trying to find the best package to do something. There are a lot of tools you can use for this: Task Views,,,, MRAN, the sos package, and so on. However, none of them are very good at recommending the best package for a given task.

This is where METACRAN comes in. Gabor gave a nice demo where he showed that when you search for “networks”, it successfully returns his igraph package. Um.

Jokes aside, it does seem like a very useful tool. His site also gives you information on trending packages.

He also mentioned another project,, a read-only mirror of CRAN, that lets you see what’s been updated when a new version of a package reaches CRAN. (Each package is a repository and each new version on CRAN counts as one commit.)

Peter Dalgaard of Copenhagen Business School, R-Core member and organiser of the summit, talked about R development conventions and directions.

We mentioned that many of the development principles that have shaped R were decided back in 1999, when R was desperately trying to gain credibility with organisations using SAS and Stata. This is, for example, why the base R distribution contains packages like nnet and spatial. While many users may not need to use neural networks or spatial statistics, it was important for the fledgeling language to be seen to have these capabilities built-in.

Peter said that some user contributed packages that are ubiquitous are being considered for inclusion into the base distribution. data.table, Rcpp and plyr in particular were mentioned. The balancing act is that more effort would go into ensuring that these packages work with new versions of R, but it requires more effort from R-Core, which is a finite resource.

Some other things that Peter talked about were that R-core worry about whether or not their traditional approach of being very conservative with the code base is too strict, and slows R’s development; and whether they should be more aggressive about removing quirks. This last point was referencing my talk from yesterday, so I was pleased that he had be listening.

Joe Rickert of Microsoft talked about the R community. He pointed out that “I use Excel, you use Excel, we have so much in common” is a conversation no-one has ever had. Joe thinks that “community” is mostly meaningless marketing speak, but R has it for real. I’m inclined to agree.

He talked about the R Consortium, which I’ve somewhere failed to hear about before. The point of the organisation is (mostly) to help build R infrastructure projects. The first big project is [REDACTED!], though official details are still secret, so there’s a bit of reading between the lines.

Users will be allowed to submit proposals for things for the consortium to build, and they get voted on (not sure if this is by users or the consortium members).

Joe also had a nice map of R User groups around the world, though my nearest one was several countries away. I guess I’d better start one of my own. If anyone in Qatar is interested in an R User Group, let me know in the comments.

Bettina Grün of Johannes Kepler Universitat talked about the R Journal. She discussed the history of the journal, and the topics that you can write about: packages, programming hints, and applications of R. There is also some content about changes in R and CRAN, and conference announcements.

One thing I didn’t get to ask her is how you blind a review of a paper about a package, since the package author is usually pretty easy to determine. My one experience of reviewing for the R Journal involved a paper about an update to the grid package, and included links to content on the University of Auckland website, and it was pretty clear that the only person that could have written it was Paul Murrell.

Mine Çetinkaya-Rundel of Duke University gave an impressive overview of how she teaches R to her undergraduate students. She suggested that while teaching programming at the same time as data analysis seems like it ought to make it harder, running a few line of code often takes less instruction than telling students where to point and click.

Her other teaching tips included: work on datasets that are big enough to make working in Excel annoying, so they appreciate programming (and R) more; use interactive examples; you get more engagement with real-world datasets; and force the students to learn a reproducible workflow by making them write R markdown documents.

Mine also talked briefly about her other projects: Datafest, which is a weekend long data analysis competition, and reach, a coursera data analysis course.

Jenny Bryan of the University of British Columbia had another talk about teaching R, this time to grad students. She’s developed a pretty slick workflow where each student submits their assignments (also R markdown documents) into github repos, which makes it really easy to check run their code, comment on it, give them hints via pull requests, and let them peer review each other’s code. Since using git seems to be an essential skill for data scientists these days, it seems like a good idea to explicitly teach them it while at university.

Jenny also talked about methods of finding interesting code via github search. An example she gave was that if you want to know how to see how vapply works, rather than just limiting yourself to the examples on the ?vapply help page, you can search all the packages on cran by going to Gabor’s github CRAN mirror (or maybe Winston Chang’s R-source mirror) and type vapply user:cran extension:R.

Karthik Ram of the University of California, Berkeley, headlined the day, talking about his work with ROpenSci, which is an organisation that creates open tools for data analysis (mostly R packages).

They have a ridiculously extensive set of packages for downloading online datasets, retrieving text corpuses, publishing your results, and working with spatial data.

ROpenSci also host regular community events and group phone calls.

Overall, it was an exciting day, and now I;m looking forward to going to the useR conference. See you in Aalborg!

Tags: ,

The Workflow of Infinite Shame, and other stories from the R Summit

27th June, 2015 5 comments

At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.

Luke Tierney of the University of Iowa, the author of the compiler package, and R-Core member who has been working on R’s performance since, well, pretty much since R was created, talked about future improvements to R’s internals.

Plans to improve R’s performance include implementing proper reference counting (that is tracking how many variables point at a particular bit of memory; the current version counts like zero/one/two-or-more, and a more accurate count means you can do less copying). Improving scalar performance and reducing function overhead are high priorities for performance enhancement. Currently when you do something like

for(i in 1:100000000) {}

R will assign a vector of length 100000000, which takes a ridiculous amount of memory. By being smart and realising that you only ever need one number at a time, you can store the vector much more efficiently. The same principle applies for seq_len and seq_along.

Other possible performance improvements that Luke discussed include having a more efficient data structure for environments, and storing the results of complex objects like model results more efficiently. (How often do you use that qr element in an lm model anyway?

Tomas Kalibera of Northeastern University has been working on a tool for finding PROTECT bugs in the R internals code. I last spoke to Tomas in 2013 when he was working with Jan Vitek on the alternate R engine, FastR. See Fearsome Engines part 1, part 2, part 3. Since then FastR has become a purely Oracle project (more on that in a moment), and the Purdue University fork of FastR has been retired.

The exact details of the PROTECT macro went a little over my head, but the essence is that it is used to stop memory being overwritten, and it’s a huge source of obscure bugs in R.

Lukas Stadler of Oracle Labs is the heir to the FastR throne. The Oracle Labs team have rebuilt it on top of Truffle, an Oracle product for generating dynamically optimized Java bytecode that can then be run on the JVM. Truffle’s big trick is that it can auto-generate this byte code for a variety of languages: R, Ruby and JavaScript are the officially supported languages, with C, Python and SmallTalk as side-projects. Lukas claimed that peak performance (that is, “on a good day”) for Truffle-generated code is comparable to language-specific optimized code.

Non-vectorised code is the main beneficiary of the speedup. He had a cool demo where a loopy version of the sum function ran slowly, then Truffle learned how to optimise it, and the result became almost as fast as the built-in sum function.

He has a complaint that the R.h API from R to C is really and API from GNU R to C, that is, it makes too many assumptions about how GNU works, and these don’t hold true when you are running a Java version of R.

Maarten-Jan Kallen from BeDataDriven works on Renjin, the other R interpreter built on top of the JVM. Based on his talk, and some other discussion with Maarten, it seems that there is a very clear mission-statement for Renjin: BeDataDriven just want a version of R that runs really fast inside Google App Engine. They also count an interesting use case forRenjin – it is currently powering software for the United Nations’ humanitarian effort in Syria.

Back to the technical details, Maarten showed an example where R 3.0.0 introduced the anyNA function as a fast version of any( In the case of Renjin, this isn’t necessary since it works quickly anyway. (Though if Luke Tierney’s talk come true, it won’t be needed in GNU R soon either.)

Calling external code still remains a problem for Renjin; in particular Rcpp and it’s reverse dependencies won’t build for it. The spread of Rcpp, he lamented, even includes roxygen2.

Hannes Mühleisen has also been working with BeDataDriven, and completed Maarten’s talk. He previously worked on integrating MonetDB with R, and has been applying his database expertise to Renjin. In the same way that when you run a query in a database, it generates a query plan to try and find the most efficient way of retrieving your results, Renjin now generates a query plan to find the most efficient way to evaluate your code. That means using a deferred execution system where you avoid calculating things until the last minute, and in some cases not at all because another calculation makes them obsolete.

Karl Millar from Google has been working on CXXR. This was a bit of a shock to me. When I interviewed Andrew Runnalls in 2013, he didn’t really sell CXXR to me particularly well. The project goals at the time were to clean up the GNU R code base, rewriting it in modern C++, and documenting it properly to use as a reference implementation of R. It all seemed a bit of academic fun rather than anything useful. Since Google has started working on the project, the focus has changed. It is now all about having a high performance version of R.

I asked Andrew why he choose CXXR for this purpose. After all, of the half a dozen alternate R engines, CXXR was the only one that didn’t explicitly have performance as a goal. His response was that it has nearly 100% code compatibility with GNU R, and that the code is so clear that it makes it easy to make changes.

That talk focussed on some of the difficulties of optimizing R code. For example, in the assignment

a <- b + c

you don’t know how long b and c are, or what their classes are, so you have to spend a long time looking things up. At runtime however, you can guess a bit better. b and c are probably the same size and class as what you used last time, so guess that first.

He also had a little dig at Tomas Kalibera’s work, saying that CXXR has managed to eliminate almost all the PROTECT macros in its codebase.

Radford Neal talked about some optimizations in his pqR project, which uses both interpreted and byte-compiled code.

In interpreted code, pqR uses a “variant result” mechanism, which sounded similar to Renjin’s “deferred execution”.

A performance boost comes from having a fast interface to unary primitives. This also makes eval faster. Another boost comes fro ma smarter way to not look for variables in certain frames. For example, a list of which frames contain overrides for special symbols (+, [, if, etc.) is maintained, so calling them is faster.

Matt Dowle (“the data.table guy”) of H2O gave a nice demo of H2O Flow, a slick web-based GUI for machine learning on big datasets. It does lots of things in parallel, and is scriptable.

Indrajit Roy of HP Labs and Michael Lawrence of Genentech gave a talk on distributed data structures. These seem very good for cases where you need to access your data from multiple machines.

The SparkR package gives access to distributed data structures with a Spark backend, however Indrajit wasn’t keen, saying that it is too low-level to be easy to work with.

Instead he and Michael have developed the dds package that gives a standard interface for using distributed data structures. The package lies on top of and The analogy is with DBI providing a stanrdard database interface that uses RSQLite or RPostgreSQL underneath.

Ryan Hafen of Tessera talked about their product (I think also called Tessara) for analysing large datasets. It’s a fancy wrapper to MapReduce that also has distributed data objects. I didn’t get chance to ask if they support the dds interface. The R packages of interest are datadr and trelliscope.

My own talk was less technical than the others today. It consisted of a series of rants about things I don’t like about R, and how to fix them. The topics included how to remove quirks from the R language (please deprecate indexing with factors), helping new users (let the R community create some vignettes to go in base-R), and how to improve CRAN (CRAN is not a code hosting service, CRAN is a shop for packages). I don’t know if any of my suggestions will be taken up, but my last slide seemed to generate some empathy.

I’ve named my CRAN submission process “The Workflow of Infinite Shame”. What tends to happen is that I check that things work on my machine, submit to CRAN, and about an hour later get a response saying “we see these errors, please fix”. Quite often, especially for things involving locales or writing files, I cannot reproduce the issue, so I fiddle about a bit and guess, then resubmit. After five or six iterations, I’ve lost all sense of dignity, and while R-core are very patient, I’m sure they assume that I’m an idiot.

CRAN currently includes a Win-builder service that lets you submit packages, builds them under Windows, then tells you the results. What I want is an everything-builder service that builds and checks my package on all the necessary platforms (Windows, OS X, Linux, BSD, Solaris on R-release, R-patched, and R-devel), and only if it passes does a member of R-core get to see the problem. That way, R-core’s time isn’t wasted, and more importantly I look like less of an idiot.

A flow diagram of CRAN submission steps with an infinite loop

The workflow of infinite shame encapsulates my CRAN submission process.

Tags: ,