Everyone loves R markdown and Github; stories from the R Summit, day two
More excellent talks today!
Andrie de Vries of Microsoft kicked off today’s talks with a demo of
checkpoint. This is his package for assisting reproducibility by letting you install packages from a specific date.
The idea is that a lot of R packages, particularly those from PhD projects don’t get maintained, and suffer bitrot. That means that they often don’t work with current versions of R and current packages.
Since I now work in a university, and our team is trying to make sure we release an accompanying package of the data analysis steps with every paper, having a system for reproducibility is important to me. I’ve played around with
packrat, and it is nice but a bit too much effort to bother with on a day-to-day basis. I’ve not used
checkpoint before, but from Andrie’s demo it seems a little easier. You just add these lines to the top of your script:
library(checkpoint) checkpoint("2014-08-18", R.version = "3.0.3")
and it checks you are running the correct version of R, downloads the packages that you are about to run, from an archived version of CRAN on the date you suggested, and sets them up as a new library. Easy reproducibility.
Jeroen Ooms of UCLA (sort of) talked about his streaming suite of packages:
mongolite, for downloading web data, converting JSON to and from R objects, and working with MongoDB databases.
There have been other packages around for each of these three tasks, but the selling point is that Jeroen is a disgustingly talented coder and has written definitive versions.
jsonlite takes care to be consistent about the weird edge cases when converting between R and JSON. (The JSON spec doesn’t support infinity or NA, so you get a bit of control about what happens there, for example.)
jsonlite also supports ndjson, where each line is a JSON object. This is important for large files: you can just parse one line at a time, then return the whole thing as a list or data frame, or you can define a line specific behaviour.
This last case is used for streaming functionality. You use
stream_in to read an ndjson file, then parse a line, manipulate the result, maybe
stream_out to somewhere else, then move on to the next line. It made me wissh I was doing fancy things with the twitter firehose.
The MongoDB stuff also sounded interesting; I didn’t quite get a grasp of what makes it better than the other mongo packages, but he gave some examples of fast document searching.
Gabor Csardi of Harvard University talked about METACRAN, which is his spare time hobby.
One of the big sticking points in many people’s work in R is trying to find the best package to do something. There are a lot of tools you can use for this: Task Views, rdocumentation.org, crantastic.org, rseek.org, MRAN, the sos package, and so on. However, none of them are very good at recommending the best package for a given task.
This is where METACRAN comes in. Gabor gave a nice demo where he showed that when you search for “networks”, it successfully returns his
igraph package. Um.
Jokes aside, it does seem like a very useful tool. His site also gives you information on trending packages.
He also mentioned another project, github.com/cran, a read-only mirror of CRAN, that lets you see what’s been updated when a new version of a package reaches CRAN. (Each package is a repository and each new version on CRAN counts as one commit.)
Peter Dalgaard of Copenhagen Business School, R-Core member and organiser of the summit, talked about R development conventions and directions.
We mentioned that many of the development principles that have shaped R were decided back in 1999, when R was desperately trying to gain credibility with organisations using SAS and Stata. This is, for example, why the base R distribution contains packages like
spatial. While many users may not need to use neural networks or spatial statistics, it was important for the fledgeling language to be seen to have these capabilities built-in.
Peter said that some user contributed packages that are ubiquitous are being considered for inclusion into the base distribution.
plyr in particular were mentioned. The balancing act is that more effort would go into ensuring that these packages work with new versions of R, but it requires more effort from R-Core, which is a finite resource.
Some other things that Peter talked about were that R-core worry about whether or not their traditional approach of being very conservative with the code base is too strict, and slows R’s development; and whether they should be more aggressive about removing quirks. This last point was referencing my talk from yesterday, so I was pleased that he had be listening.
Joe Rickert of Microsoft talked about the R community. He pointed out that “I use Excel, you use Excel, we have so much in common” is a conversation no-one has ever had. Joe thinks that “community” is mostly meaningless marketing speak, but R has it for real. I’m inclined to agree.
He talked about the R Consortium, which I’ve somewhere failed to hear about before. The point of the organisation is (mostly) to help build R infrastructure projects. The first big project is [REDACTED!], though official details are still secret, so there’s a bit of reading between the lines.
Users will be allowed to submit proposals for things for the consortium to build, and they get voted on (not sure if this is by users or the consortium members).
Joe also had a nice map of R User groups around the world, though my nearest one was several countries away. I guess I’d better start one of my own. If anyone in Qatar is interested in an R User Group, let me know in the comments.
Bettina Grün of Johannes Kepler Universitat talked about the R Journal. She discussed the history of the journal, and the topics that you can write about: packages, programming hints, and applications of R. There is also some content about changes in R and CRAN, and conference announcements.
One thing I didn’t get to ask her is how you blind a review of a paper about a package, since the package author is usually pretty easy to determine. My one experience of reviewing for the R Journal involved a paper about an update to the grid package, and included links to content on the University of Auckland website, and it was pretty clear that the only person that could have written it was Paul Murrell.
Mine Çetinkaya-Rundel of Duke University gave an impressive overview of how she teaches R to her undergraduate students. She suggested that while teaching programming at the same time as data analysis seems like it ought to make it harder, running a few line of code often takes less instruction than telling students where to point and click.
Her other teaching tips included: work on datasets that are big enough to make working in Excel annoying, so they appreciate programming (and R) more; use interactive examples; you get more engagement with real-world datasets; and force the students to learn a reproducible workflow by making them write R markdown documents.
Mine also talked briefly about her other projects: Datafest, which is a weekend long data analysis competition, and reach, a coursera data analysis course.
Jenny Bryan of the University of British Columbia had another talk about teaching R, this time to grad students. She’s developed a pretty slick workflow where each student submits their assignments (also R markdown documents) into github repos, which makes it really easy to check run their code, comment on it, give them hints via pull requests, and let them peer review each other’s code. Since using git seems to be an essential skill for data scientists these days, it seems like a good idea to explicitly teach them it while at university.
Jenny also talked about methods of finding interesting code via github search. An example she gave was that if you want to know how to see how
vapply works, rather than just limiting yourself to the examples on the
?vapply help page, you can search all the packages on cran by going to Gabor’s github CRAN mirror (or maybe Winston Chang’s R-source mirror) and type
vapply user:cran extension:R.
Karthik Ram of the University of California, Berkeley, headlined the day, talking about his work with ROpenSci, which is an organisation that creates open tools for data analysis (mostly R packages).
They have a ridiculously extensive set of packages for downloading online datasets, retrieving text corpuses, publishing your results, and working with spatial data.
ROpenSci also host regular community events and group phone calls.
Overall, it was an exciting day, and now I;m looking forward to going to the useR conference. See you in Aalborg!