New version of assertive and answers to tutorial exercises

16th July, 2015 Leave a comment

I gave a tutorial at useR on testing R code, which turned out to be a great way of getting feedback on my code! Based on the suggestions by attendees, I’ve made a big update to the package, which is now on CRAN. Full details of the new features can be access in the ?changes help page within the package.

Also, the slides, exercises and answers from the tutorial are now available online.

Tags: ,

The state of assertions in R

3rd July, 2015 5 comments

“Assertion” is computer-science jargon for a run-time check on your code. In R , this typically means function argument checks (“did they pass a numeric vector rather than a character vector into your function?”), and data quality checks (“does the date-of-birth column contain values in the past?”).

The four packages

R currently has four packages for assertions: assertive, which is mine; assertthat by Hadley Wickham, assertr By Tony Fischetti, and ensurer by Stefan Bache.

Having four packages feels like too many; we’re duplicating effort, and it makes package choice too hard for users. I didn’t know about the existence of assertr or ensurer until a couple of days ago, but the useR conference has helped bring these rivals to my attention. I’ve chatted with the authors of the other three packages to see if we can streamline things a little.

Hadley said that assertthat isn’t a high priority for him – dplyr, ggplot2 and tidyr (among many others) are more important – so he’s not going to develop it further. Since assertthat is mostly a subset of assertive anyway, this shouldn’t be a problem. I’ll take a look how easy it is to provide an assertthat API, so existing users can have a direct replacement.

Tony said that the focus of assertr is predominantly data checking. It only works with data frames, and has a more limited remit than assertive. He plans to change the backend to be built on top of assertive. That is, assertr will be an assertive extension that make it easy to apply assertions to multiple columns in data frames.

Stefan has stated that he prefers to keep ensurer separate, since it has a different philosophical stance to assertive, and I agree. ensurer is optimised for being lightweight and elegant; assertive is optimised for clarity of user code and clarity of error messages (at a cost of some bulk).

So overall, we’re down from four distinct assertion packages to two groups (assertive/assertr and assertive). This feels sensible. It’s the optimum number for minimizing duplication while still having the some competition to spur development onwards.

The assertive development plan

ensurer has one feature in particular that I definitely want to include in assertive: you can create type-safe functions.

The question of bulk has also been playing on my mind for a while. It isn’t huge by any means – the tar.gz file for the package is 836kB – but the number of functions can make it a little difficult for new users to find their way around. A couple of years ago when I was working with a lot of customer data, I included functions for checking things like the validity of UK postcodes. These are things that I’m unlikely to use at all in my current job, so it seems superfluous to have them. That means that I’d like to make assertive more modular. The core things should be available in an assertive.base package, with specialist assertions in additional packages.

I also want to make it easier for other package developers to include their own assertions in their packages. This will require a bit of rethinking about how the existing assertion engine works, and what internal bits I need to expose.

One bit of feedback I got from the attendees at my tutorial this week was that for simulation usage (where you call the same function millions of times), assertions can slow down the code too much. So a way to turn off the assertions (but keep them there for debugging purposes) would be useful.

The top feature request however, was for the use of pipe compatibility. Stefan’s magrittr package has rocketed in popularity (I’m a huge fan), so this definitely needs implementing. It should be a small fix, so I should have it included soon.

There are some other small fixes like better NA handling and a better error message for is_in_range that I plan to make soon.

The final (rather non-trivial) feature I want to add to assertive is support for error messages in multiple languages. The infrastructure is in place for translations (it currently support both the languages that I know; British English and American English), I just need some people who can speak other languages to do the translations. If you are interested in translating; drop me an email or let me know in the comments.

From cats to zombies, Wednesday at useR2015

1st July, 2015 2 comments

The morning opened with someone who I was too bleary eyed to work out who it was. Possibly the dean of the University of Aalborg. Anyway, he said that this is the largest ever useR conference, and the first ever in a Nordic country. Take that, Norway! Also, considering that there are now quite a few R-based conferences (Bioconductor has its own conference, not to mention R in Finance and EARL), it’s impressive that these haven’t taken away from the main event.

Torben the conference organiser then spoke briefly and mentioned that planning for this event started back in June 2013.

Keynote

Romain Francois gave a talk of equal parts making-R-go-faster, making-R-syntax-easier, and cat-pix. He quipped that he has an open relationship with R: he gets to use Java and C++, and “I’m fine with other people using R”. Jokes aside, he gave an overview of his big R achievements: the new and J functions for rJava that massively simplified that calling syntax; the //[[Rcpp::export]] command that massively simplified writing Rcpp code, and the internals to the dplyr package.

He also gave a demo of JJ Allaire’s RcppParallel, and talked about plans to intgrate that into dplyr for a free performance boost on multicore systems.

I also had a coffee-break chat with mathematician Luzia Burger-Ringer (awesome name), who has recently started R programming after a seventeen year career break to raise children. She said:

“When I returned to work I struggled to remember my mathematics training, but using R I could be productive. Compared to Fortran it’s just a few lines of code to get an answer.”

Considering that I’ve forgotten pretty much everything I know by the time I’ve returned from vacation, I’m impressed that by Lucia’s ability to dive in after a 17 year break. And I think this is good counter-evidence of R’s perceived tricky learning curve. Try fitting a random forest model in Fortran!

After being suitably caffeinated, I went to the interfacing session discussing connecting R to other languages.

Interfacing

Kasper Hansen had some lessons for integrating external libraries into R packages. He suggested two approaches:

“Either you link to the library, maybe with a function to download that library – this is easiest for the developer; or you include the library in your package – this is easiest for the user”.

He said he’s mostly gone for the latter approach, but said that cross-platform development in this way is mostly a bit of a nightmare.

Kasper gave examples of the illuminaio package, for reading some biological files with no defined specification, some versions of which were encrypted; the affxparser package for reading Affymetrix RNA sequence files, which didn’t have proper OS-independent file paths, and RGraphviz which connects to the apparently awfully implemented Graphviz network visualization software. There were many tales of death-by-memory-leak.

In the discussion afterwards it was interesting to note the exchange between Kasper and Dirk Eddelbuettel. Dirk suggested that Kasper was overly negative about the problems of interfacing with external libraries because he’d had the unfortunate luck to deal with many bad-but-important ones, whereas in general you can just pick good libraries to work with.

My opinion is that Kasper had to pick libraries built by biologists, and my experience is that biologists are generally better at biology than software development (to put it politely).

Christophe Best talked about calling R from Go. After creating the language, Google seem to be making good internal use of Go. And as a large organisation, they suffer from the different-people-writing-different-languages problem quite acutely. Consequently, they have a need for R to plug modelling gaps in their fledgeling systems language.

Their R-Go connector runs R in a different process to Go (unlike Rcpp, which uses an intra-process sytem, according to Christophe). This is more complex to set-up, but means that “R and Go don’t have shared crashes”.

It sounds promising, but for the moment, you can only pass atomic types and list. Support for data frames is planned, as is support for calling Go from R, so this is a project to watch.

Matt Ziubinski talked about libraries to help you work with Rcpp. He recommended Catch, a testing framework C++. The code for this looked pretty readably (even to me, who hasn’t really touched C++ in over a decade).

He also recommended Boost, which allows compile-time calculations, easy parallel processing, and pipes.

He was also a big fan of C++11, which simplifies a lot of boilerplate coding.

Dan Putler talked about connecting to Sparks Mlib packge for machine learning. He said that connecting to the library was easy, but then they wondered why they had bothered! Always fun to see some software being flamed.

Apparently the regression tools in Spark Mlib don’t hold a candle to R’s lm and glm. They may not be fancy functions, but they’ve been carefully built for robustness.

After some soul-searching, Dan decided that Spark was still worth using, despite the weakness of Mlib, since it nicely handles distributing your data.

He and his team have created a <a href=”https://github.com/AlteryxLabs/sparkGLM”>SparkGLM package that ports R’s linear regression algorithms to Spark. lm is mostly done; glm is work-in-progress.

After lunch, I went to the clustering session.

Clustering

Anders Bilgram kicked off the afternoon session with a talk on unsupervised meta-analysis using Gaussian mixed copula models. Say that ten times fast.

He described this a a semi-parametric version of the more standard Gaussian mixed models. I think he meant this as in “mixture models” where you consider your data to be consist of things from several different distributions, rather than mixed effects models where you have random effects.

The Gaussian copula bit means that you have to transform you data to be normally distributed first, and he recommended rank normalization for that.

(We do that in proteomics too; you want qnorm(rank(x) / (length(x) + 1)), and yeah, that should be in a package somewhere.)

Anders gave a couple of nice examples: he took a 1.4Mpx photo of the sapce shuttle and clustered it by pixel color, and clustered the ranks of a replicated gene study.

He did warn that he hadn’t tested his approach with high-dimensional data though.

Claudia Beleites, whic asked the previous question about high-dimensional data, went on to talk about hierarchical clustering of (you guessed it) high dimensional data. In particular, she was looking at the results of vibrational spectroscopy. This looks at the vibrations of molecules, in this case to try to determine what some tissue consists of.

The data is a big 3D-array: two dimensional images at lots of different spectral frequencies.

Claudia had a bit of a discussion about k-means versus hierarchical modelling. She suggested that the fact that k-means often overlooks small clusters, and the fact that you need to know the number of clusters in advance, meant that it was unsuitable for her datasets. The latter point was vigorously debated after the talk, with Martin Maechler arguing that for k-means analyses, you just try lots of values for the number of clusters, and see what gives you the best answer.

Anyway, Claudia had been using hierarchical clustering, and running into problems with calculation time because she has fairly big datasets and hierarchical clustering takes O(n^2) to run.

Her big breakthrough was to notice that you get more or less the same answer clustering on images, or clustering on spectra, and clustering on spectra takes far less time. She had some magic about compressing the information in spectra (peak detection?) but I didn’t follow that too closely.

Silvia Liverani talked about profile regression clustering and her PReMiuM package. She clearly has better accuracy with the shift key than I do.

Anyway, she said that if you have highly correlated variables (body mass and BMI was her example), it can cause instability and general bad behaviour in your models.

Profile regression models were her solution to this, and she described them as “Bayesian infinite mixture models”, but the technical details went over my head.

The package has support for normal/Poisson/binomial/categorical/censored response variable, missing values, and spatial correlations, so it sounds fully featured.

Silvia said it’s written in C++, but runs MCMC underneath, so that makes it medium speed.

I then dashed off to to the Kaleidoscope session for hear about Karl Broman’s socks.

Kaleidoscope2

Rasmus Bååth talked about using approximate Bayesian computation to solve the infamous Karl Broman’s socks problem. The big selling point of ABC is that you can calculate stuff where you have no idea how the calculate the maximum likelihood. Anyway, I mostly marvelled at Rasmus’s ability to turn a silly subject into a compelling statistical topic.

Keynote

Adrian Baddesley gave a keynote os spatial statistics and his work with the spatstat package. He said that in 1990 when work began on the S version of spatstat, the field of spatial statistics was considered a difficult domain to work in.

“In 1990 I taught that likelihood methods for spatial statistics were infeasible, and that time-series methods were not extensible to spatial problems.”

Since then, the introduction of MCMC, composite likelihood and non-parametric moments have made things easier, but he gave real credit to the R language for pushing things forward.

“For the first time, we could share code easily to make cumulative progress”

One persistent problem in spatial stats was how to deal with edge corrections. If you sample values inside a rectangular area, and try to calculate the distance to their nearest neighbour, then values near the edge appear to be further away because you didn’t match to points that you didn’t sample outside the box.

Apparently large academic wars were fought in the 1980s and early 90s over how best to correct for the edge effects, until R made it easy to compare methods and everyone realised that there wasn’t much difference between them.

Adrian also talked about pixel logistic regression as being a development made by the spatstat team, where you measure the distance from each pixel in an image to a response feature, then do logistic regression on the distances. This turned out to be equivalent to a Poisson point process.

He also said that the structure of R models helped to generate new research questions. The fact that you are supposed to implement residuals and confint and influence functions for every model meant that they had to invent new mathematics to calculate them.

Adrian concluded with the idea that we should seek a grand unification theory for statistics to parallel the attempt to reconcile relativity and quantum physics. Just as several decades ago lm and glm were considered separate classes of model, but today are grouped together, one day we might reconcile frequentist and Bayesian stats.

Lightning talks

These are 5 minute talks.

Rafaël Coudret described an algorithm for SAEM. It was a bit technical, and I didn’t grasp what the “SA” stood for, but the apparently it works well when you can’t figure how how to write the usual Expectation Maximization.

Thomas Leeper talked about the MTurkR interface to Amazon’s Mechanical Turk. This let’s you hire workers to do tasks like image recognition, modify and report on tasks, and even pay the workers, all without leaving the R command line.

In future, he wants to support rival services microWorkers and CrowdFunder too.

Luis Candanedo discussed modelling occupancy detection in offices, to save on the heating and electricity bills. He said that IR sensors are too expensive t obe practical, so he tried using temperature, humidity, light and CO2 sensors to detect the number of people in the office, then used photographs to make it a supervised dataset.

Random forest models showed that the light sensors were best for predicting occupancy.

He didn’t mention it, but knowing how many hospital beds are taken up is maybe an even more important use case. Though you can probably just see who has been allocated where.

Dirk Eddelbuettel talked about his drat package for making local file systems or github (or possibly anywhere else) behave like an R repo.

Basically, it bugs him that if you use devtools::install_github, then you can’t do utils::update.packages on it afterwards, and drat fixes that problem.

Saskia Freytag talked about epilepsy gene sequencing. She had 6 datasets of children’s brains gene expression data, and drew some correlation networks of them. (Actually, I’m not sure if they were correlation networks, or partial correlation networks, which seem to be more popular these days.)

Her idea was that true candidate genes for epilepsy should lie in the same networks as known epilepsy genes, thus filtering out many false negatives.

She also had a Shiny interface to help her non-technical colleagues interact with the networks.

Soma Datta talked about teaching R as a first programming language to secondary school children. She said that many of them found C++ and Java to be too hard, and that R had a much higher success rate.

Simple things like having array indicies start at one rather than zero, not having to bother with semi-colons to terminate lines, and not having to declare variable types made a huge difference to the students ability to be productive.

Alan Friedman talked about Lotka’s Law, which states that a very small number of journal paper authors write most of the papers, and it quickly drops off so that 60% of journal authors only write one paper.

He has an implementation package called LoktasLaw, which librarians might find useful.

Berry Boessenkool talked about extreme value stats. Apparently as the temperature increases, the median chance of precipitation does to. However when you look at the extreme high quantiles (> 99.9%) of the chance of precipitation, they increase upto to a temperature of 25 degrees Celsius or so, then drop again.

Berry suggested that this was a statistical artefact of not having much data, and when he did a more careful extreme value analysis, the high-quantile probability of precipitation kept increasing with temperature, as the underlying physics suggested it should.

When he talked about precipitation, I’m pretty sure he meant rain, since my rudimentary meteorological knowledge suggests that the probability of sleet and snow drops off quite sharply above zero degrees Celsius.

Jonathan Arfa talked about his participation in a Kaggle competition predicting NCAA Basketball scores in a knockout competition called March Madness.

His team used results from the previous season’s league games, Las Vegas betting odds, a commercial team metric dataset, and the distance travelled to each game to try to predict the results.

He suggested that they could have done better if they’d used a Bayesian approach: if a poor team wins it’s first couple of games, you know it is better than your model predicts.

Adolfo Alvarez gave a quick rundown of the different approaches for making your code go faster. No time for details, just a big list.

Vectorization, data.table and dplyr, do things in a database, try alternate R engines, parallelize stuff, use GPUs, use Hadoop and Spark, buy time on Amazon or Azure machines.

Karen Nielsen talked about predicting EEG data (those time series of your heart beating) using regression spline mixed models. Her big advance was to include person and trail effects into the model, which was based on the lmer function.

Andrew Kriss talked about his rstats4ag.org website, which gives statistics advice for arable farmers. The stats are fairly basic (on purpose), tailored for use by crop farmers.

Richard Layton talked about teaching graphics. “As well as drawing ‘clear’ graphs, it is important to think about the needs of the audience”, he argued.

While a dot plot may be the best option, if you’re audience had never seen them before, it may be best to use a boxplot instead. (There are no question in the lightning talks, so I didn’t get chance to ask him if he would go so far as to recommend a 4D pie chart!)

One compelling example for considering the psychology of the audience was a mosaic plot of soldiers’ deaths in the (I think first) Iraq war. By itself, the plot evokes little emotion, but if you put a picture of a soldier dying next to it, it reminds you what the numbers mean.

Michael Höhle headlined today with a talk on zombie preparedness, filling in some of the gaps in the Zombie Survival Guide.

He explained that the most important thing was to track possible zombie outbreak metrics in order to ge tan early warning of a problem. He gave a good explanation of monitoring homicides by headshot and decapitation, then correcting for the fact that the civil servants reporting these numbers had gone on holiday.

His surveillance package can also be used for non-zombie related disease outbreaks.

Tags: ,

How I made every tech company that I may ever want to work for in the future hate me, or “GO R Consortium!”

30th June, 2015 3 comments

It turns out that when people tell you things, you should listen. Like when Joe Rickert of Microsoft says “this is not news, please don’t repeat what I’m about to say”, you should maybe take note and keep your mouth shut.

I’m not quite sure how I missed that, but I did. So on Sunday night I wrote a blog post about what happened at the R summit. And last night Gavin Simpson (@acfagls) tweeted me to say “what was this R Consortium that I’d mentioned in my post?”. I responded with what Joe had said: this is an organisation contributed to by some big tech companies that work with R, designed to fund R infrastructure projects. I also mentioned a conversation that I’d overheard about a possible replacement for R-forge built on github, that I guess might have been related. This was talk in a bar, so I hadn’t assumed it was top secret or likely true, and I made it clear I was only repeating gossip.

It turns out that despite me deleting the tweet and editing my blog post, gossiping spreads rather quickly on twitter (who’d have thought), and consequently the news ended up on Computer World. It could have been worse, I could have ended up on Infoworld.

Anyway, I spent this evening apologising to Joe Rickert and all the R Consortium members that I could find.

Fortunately, the R Consortium announced itself to the public today. And if we can move on from my idiocy, I’d like to explain why I think that the R Consortium is a big deal.

R infrastructure, by which I mean the tools that you use to write R code, publish it, and consume code by others the traditionally been the responsibility of R-Core. R-Core, as well as developing R itself, maintain CRAN and the mailing lists, not to mention a good number of packages. In all my interactions with R-Core I’ve been very impressed. They are however limited by the fact that there are only 21 of them, which means that the user community outnumbers them by five orders of magnitude. There’s just a fundamental manpower bottleneck in what they can do.

In recent years, RStudio, OpenAnalytics, Revolutions Analytics (now part of Microsoft) and Tibco have been working on creating better IDEs for R. (Three of those are part of R Consortium; I’m not sure whether OpenAnalytics intend to join or not.) github and Bitbucket, while not R-specific, have taken over the code management side of things. A load of projects have been made to get R running in places that it was never designed to go (I’m thinking Renjin for R-in-Google-App-Engine, and the projects for running R inside Oracle/MonetDB/SQL Server databases, but there are many more.)

For publishing R documents, knitr has taken over the world. As well as RStudio’s RPubs facility, O’Reilly’s Atlas software lets you write in Markdown or AsciiDoc, meaning you can knit a book. I know, I’ve done it.

The trouble is, many of these projects are run by small teams in individual companies, and there hasn’t been way to grow them into bigger projects. The costs of finding out what users want, and of communicating between groups was too high.

R Consortium solves this in two ways. Firstly, it involves many of the big corporate players in R. (The R Foundation also gets at least one seat, I believe.) Having all these companies paying to sit at the same table increases the chance that they’ll speak to each other. From their point of view, they save costs by not having to implement everything themselves; for everyone else, we have the benefit of these projects being made publically available.

The other genius move is to get ideas from the community about what to build. R has suffered a little bit from the open source “if you want something, build it yourself” attitude, so having a place where you can ask other people to build things for you sounds good.

I have really high hopes for the R Consortium, and I’ll be following what they do closely. Assuming I haven’t been blacklisted by them all!*

*Please don’t let me have been blacklisted by them all.

Tags: ,

Everyone loves R markdown and Github; stories from the R Summit, day two

28th June, 2015 4 comments

More excellent talks today!

Andrie de Vries of Microsoft kicked off today’s talks with a demo of checkpoint. This is his package for assisting reproducibility by letting you install packages from a specific date.

The idea is that a lot of R packages, particularly those from PhD projects don’t get maintained, and suffer bitrot. That means that they often don’t work with current versions of R and current packages.

Since I now work in a university, and our team is trying to make sure we release an accompanying package of the data analysis steps with every paper, having a system for reproducibility is important to me. I’ve played around with packrat, and it is nice but a bit too much effort to bother with on a day-to-day basis. I’ve not used checkpoint before, but from Andrie’s demo it seems a little easier. You just add these lines to the top of your script:

library(checkpoint)
checkpoint("2014-08-18", R.version = "3.0.3")

and it checks you are running the correct version of R, downloads the packages that you are about to run, from an archived version of CRAN on the date you suggested, and sets them up as a new library. Easy reproducibility.

Jeroen Ooms of UCLA (sort of) talked about his streaming suite of packages: curl, jsonlite and mongolite, for downloading web data, converting JSON to and from R objects, and working with MongoDB databases.

There have been other packages around for each of these three tasks, but the selling point is that Jeroen is a disgustingly talented coder and has written definitive versions.

jsonlite takes care to be consistent about the weird edge cases when converting between R and JSON. (The JSON spec doesn’t support infinity or NA, so you get a bit of control about what happens there, for example.)

jsonlite also supports ndjson, where each line is a JSON object. This is important for large files: you can just parse one line at a time, then return the whole thing as a list or data frame, or you can define a line specific behaviour.

This last case is used for streaming functionality. You use stream_in to read an ndjson file, then parse a line, manipulate the result, maybe stream_out to somewhere else, then move on to the next line. It made me wissh I was doing fancy things with the twitter firehose.

The MongoDB stuff also sounded interesting; I didn’t quite get a grasp of what makes it better than the other mongo packages, but he gave some examples of fast document searching.

Gabor Csardi of Harvard University talked about METACRAN, which is his spare time hobby.

One of the big sticking points in many people’s work in R is trying to find the best package to do something. There are a lot of tools you can use for this: Task Views, rdocumentation.org, crantastic.org, rseek.org, MRAN, the sos package, and so on. However, none of them are very good at recommending the best package for a given task.

This is where METACRAN comes in. Gabor gave a nice demo where he showed that when you search for “networks”, it successfully returns his igraph package. Um.

Jokes aside, it does seem like a very useful tool. His site also gives you information on trending packages.

He also mentioned another project, github.com/cran, a read-only mirror of CRAN, that lets you see what’s been updated when a new version of a package reaches CRAN. (Each package is a repository and each new version on CRAN counts as one commit.)

Peter Dalgaard of Copenhagen Business School, R-Core member and organiser of the summit, talked about R development conventions and directions.

We mentioned that many of the development principles that have shaped R were decided back in 1999, when R was desperately trying to gain credibility with organisations using SAS and Stata. This is, for example, why the base R distribution contains packages like nnet and spatial. While many users may not need to use neural networks or spatial statistics, it was important for the fledgeling language to be seen to have these capabilities built-in.

Peter said that some user contributed packages that are ubiquitous are being considered for inclusion into the base distribution. data.table, Rcpp and plyr in particular were mentioned. The balancing act is that more effort would go into ensuring that these packages work with new versions of R, but it requires more effort from R-Core, which is a finite resource.

Some other things that Peter talked about were that R-core worry about whether or not their traditional approach of being very conservative with the code base is too strict, and slows R’s development; and whether they should be more aggressive about removing quirks. This last point was referencing my talk from yesterday, so I was pleased that he had be listening.

Joe Rickert of Microsoft talked about the R community. He pointed out that “I use Excel, you use Excel, we have so much in common” is a conversation no-one has ever had. Joe thinks that “community” is mostly meaningless marketing speak, but R has it for real. I’m inclined to agree.

He talked about the R Consortium, which I’ve somewhere failed to hear about before. The point of the organisation is (mostly) to help build R infrastructure projects. The first big project is [REDACTED!], though official details are still secret, so there’s a bit of reading between the lines.

Users will be allowed to submit proposals for things for the consortium to build, and they get voted on (not sure if this is by users or the consortium members).

Joe also had a nice map of R User groups around the world, though my nearest one was several countries away. I guess I’d better start one of my own. If anyone in Qatar is interested in an R User Group, let me know in the comments.

Bettina Grün of Johannes Kepler Universitat talked about the R Journal. She discussed the history of the journal, and the topics that you can write about: packages, programming hints, and applications of R. There is also some content about changes in R and CRAN, and conference announcements.

One thing I didn’t get to ask her is how you blind a review of a paper about a package, since the package author is usually pretty easy to determine. My one experience of reviewing for the R Journal involved a paper about an update to the grid package, and included links to content on the University of Auckland website, and it was pretty clear that the only person that could have written it was Paul Murrell.

Mine Çetinkaya-Rundel of Duke University gave an impressive overview of how she teaches R to her undergraduate students. She suggested that while teaching programming at the same time as data analysis seems like it ought to make it harder, running a few line of code often takes less instruction than telling students where to point and click.

Her other teaching tips included: work on datasets that are big enough to make working in Excel annoying, so they appreciate programming (and R) more; use interactive examples; you get more engagement with real-world datasets; and force the students to learn a reproducible workflow by making them write R markdown documents.

Mine also talked briefly about her other projects: Datafest, which is a weekend long data analysis competition, and reach, a coursera data analysis course.

Jenny Bryan of the University of British Columbia had another talk about teaching R, this time to grad students. She’s developed a pretty slick workflow where each student submits their assignments (also R markdown documents) into github repos, which makes it really easy to check run their code, comment on it, give them hints via pull requests, and let them peer review each other’s code. Since using git seems to be an essential skill for data scientists these days, it seems like a good idea to explicitly teach them it while at university.

Jenny also talked about methods of finding interesting code via github search. An example she gave was that if you want to know how to see how vapply works, rather than just limiting yourself to the examples on the ?vapply help page, you can search all the packages on cran by going to Gabor’s github CRAN mirror (or maybe Winston Chang’s R-source mirror) and type vapply user:cran extension:R.

Karthik Ram of the University of California, Berkeley, headlined the day, talking about his work with ROpenSci, which is an organisation that creates open tools for data analysis (mostly R packages).

They have a ridiculously extensive set of packages for downloading online datasets, retrieving text corpuses, publishing your results, and working with spatial data.

ROpenSci also host regular community events and group phone calls.

Overall, it was an exciting day, and now I;m looking forward to going to the useR conference. See you in Aalborg!

Tags: ,

The Workflow of Infinite Shame, and other stories from the R Summit

27th June, 2015 5 comments

At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.

Luke Tierney of the University of Iowa, the author of the compiler package, and R-Core member who has been working on R’s performance since, well, pretty much since R was created, talked about future improvements to R’s internals.

Plans to improve R’s performance include implementing proper reference counting (that is tracking how many variables point at a particular bit of memory; the current version counts like zero/one/two-or-more, and a more accurate count means you can do less copying). Improving scalar performance and reducing function overhead are high priorities for performance enhancement. Currently when you do something like

for(i in 1:100000000) {}

R will assign a vector of length 100000000, which takes a ridiculous amount of memory. By being smart and realising that you only ever need one number at a time, you can store the vector much more efficiently. The same principle applies for seq_len and seq_along.

Other possible performance improvements that Luke discussed include having a more efficient data structure for environments, and storing the results of complex objects like model results more efficiently. (How often do you use that qr element in an lm model anyway?

Tomas Kalibera of Northeastern University has been working on a tool for finding PROTECT bugs in the R internals code. I last spoke to Tomas in 2013 when he was working with Jan Vitek on the alternate R engine, FastR. See Fearsome Engines part 1, part 2, part 3. Since then FastR has become a purely Oracle project (more on that in a moment), and the Purdue University fork of FastR has been retired.

The exact details of the PROTECT macro went a little over my head, but the essence is that it is used to stop memory being overwritten, and it’s a huge source of obscure bugs in R.

Lukas Stadler of Oracle Labs is the heir to the FastR throne. The Oracle Labs team have rebuilt it on top of Truffle, an Oracle product for generating dynamically optimized Java bytecode that can then be run on the JVM. Truffle’s big trick is that it can auto-generate this byte code for a variety of languages: R, Ruby and JavaScript are the officially supported languages, with C, Python and SmallTalk as side-projects. Lukas claimed that peak performance (that is, “on a good day”) for Truffle-generated code is comparable to language-specific optimized code.

Non-vectorised code is the main beneficiary of the speedup. He had a cool demo where a loopy version of the sum function ran slowly, then Truffle learned how to optimise it, and the result became almost as fast as the built-in sum function.

He has a complaint that the R.h API from R to C is really and API from GNU R to C, that is, it makes too many assumptions about how GNU works, and these don’t hold true when you are running a Java version of R.

Maarten-Jan Kallen from BeDataDriven works on Renjin, the other R interpreter built on top of the JVM. Based on his talk, and some other discussion with Maarten, it seems that there is a very clear mission-statement for Renjin: BeDataDriven just want a version of R that runs really fast inside Google App Engine. They also count an interesting use case forRenjin – it is currently powering software for the United Nations’ humanitarian effort in Syria.

Back to the technical details, Maarten showed an example where R 3.0.0 introduced the anyNA function as a fast version of any(is.na(x)). In the case of Renjin, this isn’t necessary since it works quickly anyway. (Though if Luke Tierney’s talk come true, it won’t be needed in GNU R soon either.)

Calling external code still remains a problem for Renjin; in particular Rcpp and it’s reverse dependencies won’t build for it. The spread of Rcpp, he lamented, even includes roxygen2.

Hannes Mühleisen has also been working with BeDataDriven, and completed Maarten’s talk. He previously worked on integrating MonetDB with R, and has been applying his database expertise to Renjin. In the same way that when you run a query in a database, it generates a query plan to try and find the most efficient way of retrieving your results, Renjin now generates a query plan to find the most efficient way to evaluate your code. That means using a deferred execution system where you avoid calculating things until the last minute, and in some cases not at all because another calculation makes them obsolete.

Karl Millar from Google has been working on CXXR. This was a bit of a shock to me. When I interviewed Andrew Runnalls in 2013, he didn’t really sell CXXR to me particularly well. The project goals at the time were to clean up the GNU R code base, rewriting it in modern C++, and documenting it properly to use as a reference implementation of R. It all seemed a bit of academic fun rather than anything useful. Since Google has started working on the project, the focus has changed. It is now all about having a high performance version of R.

I asked Andrew why he choose CXXR for this purpose. After all, of the half a dozen alternate R engines, CXXR was the only one that didn’t explicitly have performance as a goal. His response was that it has nearly 100% code compatibility with GNU R, and that the code is so clear that it makes it easy to make changes.

That talk focussed on some of the difficulties of optimizing R code. For example, in the assignment

a <- b + c

you don’t know how long b and c are, or what their classes are, so you have to spend a long time looking things up. At runtime however, you can guess a bit better. b and c are probably the same size and class as what you used last time, so guess that first.

He also had a little dig at Tomas Kalibera’s work, saying that CXXR has managed to eliminate almost all the PROTECT macros in its codebase.

Radford Neal talked about some optimizations in his pqR project, which uses both interpreted and byte-compiled code.

In interpreted code, pqR uses a “variant result” mechanism, which sounded similar to Renjin’s “deferred execution”.

A performance boost comes from having a fast interface to unary primitives. This also makes eval faster. Another boost comes fro ma smarter way to not look for variables in certain frames. For example, a list of which frames contain overrides for special symbols (+, [, if, etc.) is maintained, so calling them is faster.

Matt Dowle (“the data.table guy”) of H2O gave a nice demo of H2O Flow, a slick web-based GUI for machine learning on big datasets. It does lots of things in parallel, and is scriptable.

Indrajit Roy of HP Labs and Michael Lawrence of Genentech gave a talk on distributed data structures. These seem very good for cases where you need to access your data from multiple machines.

The SparkR package gives access to distributed data structures with a Spark backend, however Indrajit wasn’t keen, saying that it is too low-level to be easy to work with.

Instead he and Michael have developed the dds package that gives a standard interface for using distributed data structures. The package lies on top of Spark.dds and distributedR.dds. The analogy is with DBI providing a stanrdard database interface that uses RSQLite or RPostgreSQL underneath.

Ryan Hafen of Tessera talked about their product (I think also called Tessara) for analysing large datasets. It’s a fancy wrapper to MapReduce that also has distributed data objects. I didn’t get chance to ask if they support the dds interface. The R packages of interest are datadr and trelliscope.

My own talk was less technical than the others today. It consisted of a series of rants about things I don’t like about R, and how to fix them. The topics included how to remove quirks from the R language (please deprecate indexing with factors), helping new users (let the R community create some vignettes to go in base-R), and how to improve CRAN (CRAN is not a code hosting service, CRAN is a shop for packages). I don’t know if any of my suggestions will be taken up, but my last slide seemed to generate some empathy.

I’ve named my CRAN submission process “The Workflow of Infinite Shame”. What tends to happen is that I check that things work on my machine, submit to CRAN, and about an hour later get a response saying “we see these errors, please fix”. Quite often, especially for things involving locales or writing files, I cannot reproduce the issue, so I fiddle about a bit and guess, then resubmit. After five or six iterations, I’ve lost all sense of dignity, and while R-core are very patient, I’m sure they assume that I’m an idiot.

CRAN currently includes a Win-builder service that lets you submit packages, builds them under Windows, then tells you the results. What I want is an everything-builder service that builds and checks my package on all the necessary platforms (Windows, OS X, Linux, BSD, Solaris on R-release, R-patched, and R-devel), and only if it passes does a member of R-core get to see the problem. That way, R-core’s time isn’t wasted, and more importantly I look like less of an idiot.

A flow diagram of CRAN submission steps with an infinite loop

The workflow of infinite shame encapsulates my CRAN submission process.

Tags: ,

Thoughts on R’s Terrible, Horrible, No Good, Very Bad Documentation

10th March, 2015 15 comments

Book cover from Alexander and the Terrible, Horrible, No Good, Very Bad Day by Judith Viorst
A couple of days ago Pete Werner had a rant about the state of R’s documentation. A lot of it was misguided, but it had some legitimate complaints, and the fact that people can perceive R’s documentation as being bad (whether accurate or not) is important in itself.

The exponential growth in R’s popularity means that a large proportion of its user’s are beginners. The demographic also increasingly includes people who don’t come from a traditional statistics or data analysis background – I work with biologists and chemists to whom R is a secondary skill after their lab work.

All this means that I think it’s important for the R community to have an honest discussion about how it can make itself accessible for beginners, and dispel the notion that it is hard to learn.

Function help pages

Pete’s big complaint was that the ? function help pages only act as a reference; they aren’t useful for finding functions if you don’t know what you want already. This is pretty dumb; every programming language worth using has a similar function-level reference system, and they are never used for finding new functions. I happen to think that R’s function-level reference are, on the whole, pretty good. The fact that you can’t get a package submitted to CRAN without dozens of check on the documentation means that all functions have at least their usage documented, and most have some description and examples too.

Searching for functions

His complaint that it is hard to find a function when you don’t know the name carries a little more weight.

He gives the example of trying to find a function to create an identity matrix. ??identity returns many irrelevant things, but he spots the function identity. Of course, identity isn’t what he wants; it just returns its input, so Pete gives up.

I agree that identity should have a “see also” link to tell people to use diag instead if they want an identity matrix. After reading Pete’s post I filed a bug report, and 3 hours later, Martin Maechler made the change to R’s documentation. All fixed.

While’s R function-level documentation is fairly mature, there is definitely more scope for linking between pages to point people in the right direction. If you think that there is a missing link between help pages, write a comment, and I’ll file a new bug with the collated suggestions. Similarly, there are a few functions that could use better examples. Feel free to comment about those.

The failure of ??identity to find an identity matrix function is unfortunately typical. ??"identity matrix" would have been a fairer search, and it gets rid of most the rubbish, but still doesn’t find diag.

In general, I find that ?? isn’t great, unless I’m searching for a fairly obscure term. I also don’t see an easy fix for that. Fortunately, there’s an alternative. I use Rseek as my first choice tool for finding new functions. In this case, the first result for a search for “identity matrix” is a blog post entitled “How do I Create the Identity Matrix in R?”, which gives the right answer.

When I teach R to beginners, Rseek gets mentioned in lesson one. It is absolutely fundamental to R usage. So I don’t believe that finding the right function to use is a big problem in R either, except to new users who don’t know about Rseek.

The thing is, there’s a way to fix that. Rseek, as far as I know, is entirely run by Sasha Goodman right now. If he gets hit by a bus, several million R users are going to be stuck. This is a big vulnerability to R, and I think it’s time that Rseek became an official R project.

I should also mention that R has other built-in ways of finding functions beyond ??, and as Pete linked to, Pat Burns’ guide to them is excellent.

Concept-level documentation

Pete’s final complaint was that there is a lack of concept-level documentation. That is, how do you string several functions together to achieve a task?

Actually, there is a lot of concept-level documentation around; it just comes in many forms, and you have to learn what those forms are.

demo() brings up a list of demonstrations of how to do particular tasks. This command appears in the startup text when R loads, so there is no excuse for not knowing about it. There are only 16 of them though, so I think that these are worth revisiting for expansion.

browseVignettes() brings up a list of vignettes. These are short documents on a particular task. Many packages have them, and it is a good idea to read them when you start using new package.

The base-R packages, other than grid and Matrix, aren’t well represented with vignettes. Much of the content that would have gone into vignettes appears in the manual Introduction to R, but there is definite room for improvement. For example, a vignette on subsetting or basic plotting might stave off a few questions to the r-help mailing list.

Another point to remember is that R-core only consists of 20 people (and I’m not sure how many of those are still actively working on R), so much of the how-to documentation has been created by the users. There are a ridiculous number of free resources available; just take a look at the Stack Overflow R Tag Info page.

tl;dr

  1. R’s function level documentation is mostly very good. There are a few “see also”s missing, and some of the examples could be improved.
  2. The built-in facilities to find a function aren’t usually as successful as searching on Rseek. I think Rseek ought to be an official R project.
  3. Concept-level documentation is covered by demos and vignettes, though I think there should be a few more of these in base-R.

Update: Andrie de Vries tweeted me to say that Google has gotten better at returning R-related content, so searching for [r] "identity matrix" returns what you want, and in fact r "identity matrix" does too.

Follow

Get every new post delivered to your Inbox.

Join 315 other followers