Everyone loves R markdown and Github; stories from the R Summit, day two

28th June, 2015 6 comments

More excellent talks today!

Andrie de Vries of Microsoft kicked off today’s talks with a demo of checkpoint. This is his package for assisting reproducibility by letting you install packages from a specific date.

The idea is that a lot of R packages, particularly those from PhD projects don’t get maintained, and suffer bitrot. That means that they often don’t work with current versions of R and current packages.

Since I now work in a university, and our team is trying to make sure we release an accompanying package of the data analysis steps with every paper, having a system for reproducibility is important to me. I’ve played around with packrat, and it is nice but a bit too much effort to bother with on a day-to-day basis. I’ve not used checkpoint before, but from Andrie’s demo it seems a little easier. You just add these lines to the top of your script:

checkpoint("2014-08-18", R.version = "3.0.3")

and it checks you are running the correct version of R, downloads the packages that you are about to run, from an archived version of CRAN on the date you suggested, and sets them up as a new library. Easy reproducibility.

Jeroen Ooms of UCLA (sort of) talked about his streaming suite of packages: curl, jsonlite and mongolite, for downloading web data, converting JSON to and from R objects, and working with MongoDB databases.

There have been other packages around for each of these three tasks, but the selling point is that Jeroen is a disgustingly talented coder and has written definitive versions.

jsonlite takes care to be consistent about the weird edge cases when converting between R and JSON. (The JSON spec doesn’t support infinity or NA, so you get a bit of control about what happens there, for example.)

jsonlite also supports ndjson, where each line is a JSON object. This is important for large files: you can just parse one line at a time, then return the whole thing as a list or data frame, or you can define a line specific behaviour.

This last case is used for streaming functionality. You use stream_in to read an ndjson file, then parse a line, manipulate the result, maybe stream_out to somewhere else, then move on to the next line. It made me wissh I was doing fancy things with the twitter firehose.

The MongoDB stuff also sounded interesting; I didn’t quite get a grasp of what makes it better than the other mongo packages, but he gave some examples of fast document searching.

Gabor Csardi of Harvard University talked about METACRAN, which is his spare time hobby.

One of the big sticking points in many people’s work in R is trying to find the best package to do something. There are a lot of tools you can use for this: Task Views, rdocumentation.org, crantastic.org, rseek.org, MRAN, the sos package, and so on. However, none of them are very good at recommending the best package for a given task.

This is where METACRAN comes in. Gabor gave a nice demo where he showed that when you search for “networks”, it successfully returns his igraph package. Um.

Jokes aside, it does seem like a very useful tool. His site also gives you information on trending packages.

He also mentioned another project, github.com/cran, a read-only mirror of CRAN, that lets you see what’s been updated when a new version of a package reaches CRAN. (Each package is a repository and each new version on CRAN counts as one commit.)

Peter Dalgaard of Copenhagen Business School, R-Core member and organiser of the summit, talked about R development conventions and directions.

We mentioned that many of the development principles that have shaped R were decided back in 1999, when R was desperately trying to gain credibility with organisations using SAS and Stata. This is, for example, why the base R distribution contains packages like nnet and spatial. While many users may not need to use neural networks or spatial statistics, it was important for the fledgeling language to be seen to have these capabilities built-in.

Peter said that some user contributed packages that are ubiquitous are being considered for inclusion into the base distribution. data.table, Rcpp and plyr in particular were mentioned. The balancing act is that more effort would go into ensuring that these packages work with new versions of R, but it requires more effort from R-Core, which is a finite resource.

Some other things that Peter talked about were that R-core worry about whether or not their traditional approach of being very conservative with the code base is too strict, and slows R’s development; and whether they should be more aggressive about removing quirks. This last point was referencing my talk from yesterday, so I was pleased that he had be listening.

Joe Rickert of Microsoft talked about the R community. He pointed out that “I use Excel, you use Excel, we have so much in common” is a conversation no-one has ever had. Joe thinks that “community” is mostly meaningless marketing speak, but R has it for real. I’m inclined to agree.

He talked about the R Consortium, which I’ve somewhere failed to hear about before. The point of the organisation is (mostly) to help build R infrastructure projects. The first big project is [REDACTED!], though official details are still secret, so there’s a bit of reading between the lines.

Users will be allowed to submit proposals for things for the consortium to build, and they get voted on (not sure if this is by users or the consortium members).

Joe also had a nice map of R User groups around the world, though my nearest one was several countries away. I guess I’d better start one of my own. If anyone in Qatar is interested in an R User Group, let me know in the comments.

Bettina Grün of Johannes Kepler Universitat talked about the R Journal. She discussed the history of the journal, and the topics that you can write about: packages, programming hints, and applications of R. There is also some content about changes in R and CRAN, and conference announcements.

One thing I didn’t get to ask her is how you blind a review of a paper about a package, since the package author is usually pretty easy to determine. My one experience of reviewing for the R Journal involved a paper about an update to the grid package, and included links to content on the University of Auckland website, and it was pretty clear that the only person that could have written it was Paul Murrell.

Mine Çetinkaya-Rundel of Duke University gave an impressive overview of how she teaches R to her undergraduate students. She suggested that while teaching programming at the same time as data analysis seems like it ought to make it harder, running a few line of code often takes less instruction than telling students where to point and click.

Her other teaching tips included: work on datasets that are big enough to make working in Excel annoying, so they appreciate programming (and R) more; use interactive examples; you get more engagement with real-world datasets; and force the students to learn a reproducible workflow by making them write R markdown documents.

Mine also talked briefly about her other projects: Datafest, which is a weekend long data analysis competition, and reach, a coursera data analysis course.

Jenny Bryan of the University of British Columbia had another talk about teaching R, this time to grad students. She’s developed a pretty slick workflow where each student submits their assignments (also R markdown documents) into github repos, which makes it really easy to check run their code, comment on it, give them hints via pull requests, and let them peer review each other’s code. Since using git seems to be an essential skill for data scientists these days, it seems like a good idea to explicitly teach them it while at university.

Jenny also talked about methods of finding interesting code via github search. An example she gave was that if you want to know how to see how vapply works, rather than just limiting yourself to the examples on the ?vapply help page, you can search all the packages on cran by going to Gabor’s github CRAN mirror (or maybe Winston Chang’s R-source mirror) and type vapply user:cran extension:R.

Karthik Ram of the University of California, Berkeley, headlined the day, talking about his work with ROpenSci, which is an organisation that creates open tools for data analysis (mostly R packages).

They have a ridiculously extensive set of packages for downloading online datasets, retrieving text corpuses, publishing your results, and working with spatial data.

ROpenSci also host regular community events and group phone calls.

Overall, it was an exciting day, and now I;m looking forward to going to the useR conference. See you in Aalborg!

Tags: ,

The Workflow of Infinite Shame, and other stories from the R Summit

27th June, 2015 5 comments

At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.

Luke Tierney of the University of Iowa, the author of the compiler package, and R-Core member who has been working on R’s performance since, well, pretty much since R was created, talked about future improvements to R’s internals.

Plans to improve R’s performance include implementing proper reference counting (that is tracking how many variables point at a particular bit of memory; the current version counts like zero/one/two-or-more, and a more accurate count means you can do less copying). Improving scalar performance and reducing function overhead are high priorities for performance enhancement. Currently when you do something like

for(i in 1:100000000) {}

R will assign a vector of length 100000000, which takes a ridiculous amount of memory. By being smart and realising that you only ever need one number at a time, you can store the vector much more efficiently. The same principle applies for seq_len and seq_along.

Other possible performance improvements that Luke discussed include having a more efficient data structure for environments, and storing the results of complex objects like model results more efficiently. (How often do you use that qr element in an lm model anyway?

Tomas Kalibera of Northeastern University has been working on a tool for finding PROTECT bugs in the R internals code. I last spoke to Tomas in 2013 when he was working with Jan Vitek on the alternate R engine, FastR. See Fearsome Engines part 1, part 2, part 3. Since then FastR has become a purely Oracle project (more on that in a moment), and the Purdue University fork of FastR has been retired.

The exact details of the PROTECT macro went a little over my head, but the essence is that it is used to stop memory being overwritten, and it’s a huge source of obscure bugs in R.

Lukas Stadler of Oracle Labs is the heir to the FastR throne. The Oracle Labs team have rebuilt it on top of Truffle, an Oracle product for generating dynamically optimized Java bytecode that can then be run on the JVM. Truffle’s big trick is that it can auto-generate this byte code for a variety of languages: R, Ruby and JavaScript are the officially supported languages, with C, Python and SmallTalk as side-projects. Lukas claimed that peak performance (that is, “on a good day”) for Truffle-generated code is comparable to language-specific optimized code.

Non-vectorised code is the main beneficiary of the speedup. He had a cool demo where a loopy version of the sum function ran slowly, then Truffle learned how to optimise it, and the result became almost as fast as the built-in sum function.

He has a complaint that the R.h API from R to C is really and API from GNU R to C, that is, it makes too many assumptions about how GNU works, and these don’t hold true when you are running a Java version of R.

Maarten-Jan Kallen from BeDataDriven works on Renjin, the other R interpreter built on top of the JVM. Based on his talk, and some other discussion with Maarten, it seems that there is a very clear mission-statement for Renjin: BeDataDriven just want a version of R that runs really fast inside Google App Engine. They also count an interesting use case forRenjin – it is currently powering software for the United Nations’ humanitarian effort in Syria.

Back to the technical details, Maarten showed an example where R 3.0.0 introduced the anyNA function as a fast version of any(is.na(x)). In the case of Renjin, this isn’t necessary since it works quickly anyway. (Though if Luke Tierney’s talk come true, it won’t be needed in GNU R soon either.)

Calling external code still remains a problem for Renjin; in particular Rcpp and it’s reverse dependencies won’t build for it. The spread of Rcpp, he lamented, even includes roxygen2.

Hannes Mühleisen has also been working with BeDataDriven, and completed Maarten’s talk. He previously worked on integrating MonetDB with R, and has been applying his database expertise to Renjin. In the same way that when you run a query in a database, it generates a query plan to try and find the most efficient way of retrieving your results, Renjin now generates a query plan to find the most efficient way to evaluate your code. That means using a deferred execution system where you avoid calculating things until the last minute, and in some cases not at all because another calculation makes them obsolete.

Karl Millar from Google has been working on CXXR. This was a bit of a shock to me. When I interviewed Andrew Runnalls in 2013, he didn’t really sell CXXR to me particularly well. The project goals at the time were to clean up the GNU R code base, rewriting it in modern C++, and documenting it properly to use as a reference implementation of R. It all seemed a bit of academic fun rather than anything useful. Since Google has started working on the project, the focus has changed. It is now all about having a high performance version of R.

I asked Andrew why he choose CXXR for this purpose. After all, of the half a dozen alternate R engines, CXXR was the only one that didn’t explicitly have performance as a goal. His response was that it has nearly 100% code compatibility with GNU R, and that the code is so clear that it makes it easy to make changes.

That talk focussed on some of the difficulties of optimizing R code. For example, in the assignment

a <- b + c

you don’t know how long b and c are, or what their classes are, so you have to spend a long time looking things up. At runtime however, you can guess a bit better. b and c are probably the same size and class as what you used last time, so guess that first.

He also had a little dig at Tomas Kalibera’s work, saying that CXXR has managed to eliminate almost all the PROTECT macros in its codebase.

Radford Neal talked about some optimizations in his pqR project, which uses both interpreted and byte-compiled code.

In interpreted code, pqR uses a “variant result” mechanism, which sounded similar to Renjin’s “deferred execution”.

A performance boost comes from having a fast interface to unary primitives. This also makes eval faster. Another boost comes fro ma smarter way to not look for variables in certain frames. For example, a list of which frames contain overrides for special symbols (+, [, if, etc.) is maintained, so calling them is faster.

Matt Dowle (“the data.table guy”) of H2O gave a nice demo of H2O Flow, a slick web-based GUI for machine learning on big datasets. It does lots of things in parallel, and is scriptable.

Indrajit Roy of HP Labs and Michael Lawrence of Genentech gave a talk on distributed data structures. These seem very good for cases where you need to access your data from multiple machines.

The SparkR package gives access to distributed data structures with a Spark backend, however Indrajit wasn’t keen, saying that it is too low-level to be easy to work with.

Instead he and Michael have developed the dds package that gives a standard interface for using distributed data structures. The package lies on top of Spark.dds and distributedR.dds. The analogy is with DBI providing a stanrdard database interface that uses RSQLite or RPostgreSQL underneath.

Ryan Hafen of Tessera talked about their product (I think also called Tessara) for analysing large datasets. It’s a fancy wrapper to MapReduce that also has distributed data objects. I didn’t get chance to ask if they support the dds interface. The R packages of interest are datadr and trelliscope.

My own talk was less technical than the others today. It consisted of a series of rants about things I don’t like about R, and how to fix them. The topics included how to remove quirks from the R language (please deprecate indexing with factors), helping new users (let the R community create some vignettes to go in base-R), and how to improve CRAN (CRAN is not a code hosting service, CRAN is a shop for packages). I don’t know if any of my suggestions will be taken up, but my last slide seemed to generate some empathy.

I’ve named my CRAN submission process “The Workflow of Infinite Shame”. What tends to happen is that I check that things work on my machine, submit to CRAN, and about an hour later get a response saying “we see these errors, please fix”. Quite often, especially for things involving locales or writing files, I cannot reproduce the issue, so I fiddle about a bit and guess, then resubmit. After five or six iterations, I’ve lost all sense of dignity, and while R-core are very patient, I’m sure they assume that I’m an idiot.

CRAN currently includes a Win-builder service that lets you submit packages, builds them under Windows, then tells you the results. What I want is an everything-builder service that builds and checks my package on all the necessary platforms (Windows, OS X, Linux, BSD, Solaris on R-release, R-patched, and R-devel), and only if it passes does a member of R-core get to see the problem. That way, R-core’s time isn’t wasted, and more importantly I look like less of an idiot.

A flow diagram of CRAN submission steps with an infinite loop

The workflow of infinite shame encapsulates my CRAN submission process.

Tags: ,

Thoughts on R’s Terrible, Horrible, No Good, Very Bad Documentation

10th March, 2015 15 comments

Book cover from Alexander and the Terrible, Horrible, No Good, Very Bad Day by Judith Viorst
A couple of days ago Pete Werner had a rant about the state of R’s documentation. A lot of it was misguided, but it had some legitimate complaints, and the fact that people can perceive R’s documentation as being bad (whether accurate or not) is important in itself.

The exponential growth in R’s popularity means that a large proportion of its user’s are beginners. The demographic also increasingly includes people who don’t come from a traditional statistics or data analysis background – I work with biologists and chemists to whom R is a secondary skill after their lab work.

All this means that I think it’s important for the R community to have an honest discussion about how it can make itself accessible for beginners, and dispel the notion that it is hard to learn.

Function help pages

Pete’s big complaint was that the ? function help pages only act as a reference; they aren’t useful for finding functions if you don’t know what you want already. This is pretty dumb; every programming language worth using has a similar function-level reference system, and they are never used for finding new functions. I happen to think that R’s function-level reference are, on the whole, pretty good. The fact that you can’t get a package submitted to CRAN without dozens of check on the documentation means that all functions have at least their usage documented, and most have some description and examples too.

Searching for functions

His complaint that it is hard to find a function when you don’t know the name carries a little more weight.

He gives the example of trying to find a function to create an identity matrix. ??identity returns many irrelevant things, but he spots the function identity. Of course, identity isn’t what he wants; it just returns its input, so Pete gives up.

I agree that identity should have a “see also” link to tell people to use diag instead if they want an identity matrix. After reading Pete’s post I filed a bug report, and 3 hours later, Martin Maechler made the change to R’s documentation. All fixed.

While’s R function-level documentation is fairly mature, there is definitely more scope for linking between pages to point people in the right direction. If you think that there is a missing link between help pages, write a comment, and I’ll file a new bug with the collated suggestions. Similarly, there are a few functions that could use better examples. Feel free to comment about those.

The failure of ??identity to find an identity matrix function is unfortunately typical. ??"identity matrix" would have been a fairer search, and it gets rid of most the rubbish, but still doesn’t find diag.

In general, I find that ?? isn’t great, unless I’m searching for a fairly obscure term. I also don’t see an easy fix for that. Fortunately, there’s an alternative. I use Rseek as my first choice tool for finding new functions. In this case, the first result for a search for “identity matrix” is a blog post entitled “How do I Create the Identity Matrix in R?”, which gives the right answer.

When I teach R to beginners, Rseek gets mentioned in lesson one. It is absolutely fundamental to R usage. So I don’t believe that finding the right function to use is a big problem in R either, except to new users who don’t know about Rseek.

The thing is, there’s a way to fix that. Rseek, as far as I know, is entirely run by Sasha Goodman right now. If he gets hit by a bus, several million R users are going to be stuck. This is a big vulnerability to R, and I think it’s time that Rseek became an official R project.

I should also mention that R has other built-in ways of finding functions beyond ??, and as Pete linked to, Pat Burns’ guide to them is excellent.

Concept-level documentation

Pete’s final complaint was that there is a lack of concept-level documentation. That is, how do you string several functions together to achieve a task?

Actually, there is a lot of concept-level documentation around; it just comes in many forms, and you have to learn what those forms are.

demo() brings up a list of demonstrations of how to do particular tasks. This command appears in the startup text when R loads, so there is no excuse for not knowing about it. There are only 16 of them though, so I think that these are worth revisiting for expansion.

browseVignettes() brings up a list of vignettes. These are short documents on a particular task. Many packages have them, and it is a good idea to read them when you start using new package.

The base-R packages, other than grid and Matrix, aren’t well represented with vignettes. Much of the content that would have gone into vignettes appears in the manual Introduction to R, but there is definite room for improvement. For example, a vignette on subsetting or basic plotting might stave off a few questions to the r-help mailing list.

Another point to remember is that R-core only consists of 20 people (and I’m not sure how many of those are still actively working on R), so much of the how-to documentation has been created by the users. There are a ridiculous number of free resources available; just take a look at the Stack Overflow R Tag Info page.


  1. R’s function level documentation is mostly very good. There are a few “see also”s missing, and some of the examples could be improved.
  2. The built-in facilities to find a function aren’t usually as successful as searching on Rseek. I think Rseek ought to be an official R project.
  3. Concept-level documentation is covered by demos and vignettes, though I think there should be a few more of these in base-R.

Update: Andrie de Vries tweeted me to say that Google has gotten better at returning R-related content, so searching for [r] "identity matrix" returns what you want, and in fact r "identity matrix" does too.

Many package updates on CRAN

4th February, 2015 2 comments

Over the last week or two I’ve been pushing all my packages to CRAN.

pathological (for working with file paths), runittotestthat (for converting RUnit tests to testthat tests), and rebus (formerly regex, for building regular expressions in a human readable way) all make their CRAN debuts.

assertive, for run-time testing your code has more checks for the state of your R setup (is_r_devel, is_rstudio, r_has_png_capability, and many more), checks for the state of your variables (are_same_length, etc.), and utilities (dont_stop).

sig (for checking that your function signatures are sensible) now works with primitive functions too.

learningr (to accompany the book) has a reference URL fix but is otherwise the same.

I encourage you to take a look at some or all of them, and give me feedback.

How do you get things into base-R?

15th January, 2015 2 comments

A couple of months ago I spotted that the examples for the paste function weren’t very good, and actually, there were quite a few functions that new users of R are likely to encounter, that weren’t well explained.

I’ve now managed to get some updated examples into R (paste, sum, NumericConstants, pie, a couple dozen more functions hopefully to follow), and a few people have asked how I did it.

The important thing to remember is that R is a communist ecosystem: all R users are equal, except those in the ruling R Core Team Party. If you want anything to happen, you need to persuade a member of R Core.

If the change can be made by creating of adding to a package, you’ll probably find it easier to do that. R Core members all have day jobs, and get a lot of crappy requests, so expect your request to be considered very low priority.

Of course, there is a certain amount of psychology involved in this. If R Core have heard of you, then your chances increase a little. Spend some time discussing things on r-devel, upload some packages to CRAN, and say hello to them at useR. (I mostly avoid r-devel in favour Stack Overflow, spend a fair amount of time getting packages rejected from CRAN, but I have said hello to a few of them.)

There are three ways of making your request to R Core:

File a bug report

If you have found a problem that you can reproduce, and can articulate, and seems possible to fix in a reasonable amount of time, then file a bug report.

My success rate with bugs is slightly better than fifty-fifty: 12 out of 22 were closed as fixed, and one more was closed as won’t-fix but was then fixed anyway. This means that you need to be psychologically prepared for rejection: not all your fixes will be made, and you shouldn’t get upset when they don’t.

There are many pitfalls involved in filing a bug report, so if you want to stand any chance of success, then try to follow these guidelines.

– You absolutely must have a reproducible example of the behaviour that you want to change. If you can’t reproduce it, you aren’t ready to submit the bug report.

– Speculation as to the cause of a problem is a cultural no-no. Brian Ripley will tell you off if you do this. Stick to the facts.

– If the problem involves external libraries (the Windows API for example), then your chance of getting a successful change is drastically reduced, and you may want to consider other methods of contacting R Core before you file a bug report.

– The same is true if your proposed change involves changing the signature of a base R function (even if it doesn’t conflict with existing behaviour).

– Make it as easy as you can for a change to be made. Your probability of success decays exponentially with the amount of time it takes for the fix to be made. (This is true of any software project, not just R.)

Start a discussion on the r-devel mailing list

If you can’t consistently reproduce a problem, or if there are several possible behaviours that could be right, or there is some sort of trade-off involved in making a change, then you should ask here first.

Having a discussion on r-devel means that other community members get to discuss the pros and cons of your idea, and allows people to test your idea on other operating systems/versions of R, and with other use cases.

Contact the member of R Core directly

If your problem is fuzzy, and you aren’t finding any luck on r-devel, then you might want to try contacting a member of R Core directly. (I want to emphasise that you should usually try r-devel first, and email is less invasive than a phone call or turning up at their house.)

Most bits of R fall under general R Core maintenance but some specialist packages have a single maintainer, and direct contact is likely to be more successful for these.

For example, for anything related to the grid package, you need to speak to Paul Murrell. For anything related to lattice, you want Deepayan Sarkar. For codetools, you want Luke Tierney. Find out who the maintainers are with:

lapply(dir(R.home("library")), packageDescription)

In the end, you may use several methods of communication: I submitted a bug report about the paste examples, and then had some follow up email with Martin Maechler about updating further examples.

In terms of time-scales, I’ve had a few bugs fixed the same day, but in general expect things to take weeks to months, especially for bigger changes.

Tags: , ,

Update on improving examples in base-R

24th December, 2014 2 comments

Last month I was ranting about the state of some of the examples in base-R, particularly the paste function.

Martin Maechler has now kindly taken my suggested examples and added them into R. Hopefully this will reduce the number of newbie questions about “how do I join these strings together”.

Since Martin showed some interest in improving the state of the examples, I’ve given him updated content for another 30 or so help pages, and some Christmas homework of getting them into R too!

Tags: ,

Improving base-R examples

25th November, 2014 11 comments

Earlier today I saw the hundred bazillionth question about how to use the paste function. My initial response was “take a look at example(paste) to see how it works”.

Then I looked at example(paste), and it turns out that it’s not very good at all. There isn’t even an example of how to use the collapse argument. Considering that paste is one of the first functions that beginners come across, as well as being a little bit tricky (getting to understand the difference between the sep and collapse arguments takes a bit of thinking about when you are new), this seems like a big oversight.

I’ve submitted this as a bug, with a suggested improvement to the examples. Fingers crossed that R-core will accept the update, or something like it.

It got me thinking though, how many other base functions could do with better examples? I had a quick look at some common functions that beginners seems to get confused with, and the following all have fairly bad example sections:
In base: browser, get, seq
In stats: formula, lm, runif, t.test
In graphics: plot
In utils: download.file, read.table

If you have half an hour spare, have a go at writing a better example page for one of these functions, or any other function in the base distribution, then submit it to the bug tracker. (If you aren’t sure that your examples are good enough, or you need advice, try posting what you have on r-devel before submitting a bug report. Dealing with bug reports takes up valuable R-core time, so you need to be sure of quality first.)

This seems like a really easy way to make R more accessible for beginners.

Regular expressions for everyone else

25th September, 2014 Leave a comment

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.

Finally, a use for rapply

15th July, 2014 2 comments

In the apply family of functions, rapply is the unloved ginger stepchild. While lapply, sapply and vapply make regular appearances in my code, and apply and tapply have occasional cameo appearances, in ten years of R coding, I’ve never once found a good use for rapply.

Maybe once a year I take a look at the help page, decide it looks to complicated, and ignore the function again. So today I was very pleased to have found a genuine use for the function. It isn’t life-changing, but it’s quite cute.

Complex classes often have a print method that hides their internals. For example, regression models created by glm are lists with thirty elements, but their print method displays only the call, the coefficients and a few statistics.

# From example(glm)
utils::data(anorexia, package = "MASS")
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
                family = gaussian, data = anorexia)

To see everything, you need to use unclass.

unclass(anorex.1) #many pages of output

unclass has a limitation: it only removes the top level class, so subelements keep their classes. For example, compare:

class(unclass(anorex.1)$qr) # qr
class(unclass(anorex.1$qr)) # list

Using rapply, we can remove classes throughout the whole of the object, turning it into a list of simple objects.

rapply(anorex.1, unclass, how = "replace")

As well as allowing us to thoroughly inspect the contents of the object, it also allows the object to be used with other code that doesn’t understand particular classes.

Tags: , , ,

Automatically convert RUnit tests to testthat tests

12th May, 2014 2 comments

There’s a new version of my assertive package, for sanity-checking code, on its way to CRAN. The release has been delayed a while, since my previous attempt at an upload met with an error that was only generated on the CRAN machine, but not on my own. The problem lay with some code designed to autorun the RUnit tests for the package. After fiddling for a while and getting nowhere, I decided it was time to make the switch to testthat.

I’ve been a long-time RUnit user, since the syntax is near-identical to every other xUnit variant in every programming language. So switching between RUnit and MATLAB xUnit or NUnit requires no thinking. testthat has a couple of important advantages over RUnit though.

test_package makes it easy to, ahem, test your package. A page of code for finding and nicely displaying bad tests has been reduced to test_package("assertive").

Secondly, testing for warnings is much cleaner. In RUnit, you have to use convoluted mechanisms like:

test.sqrt.negative_numbers.throws_a_warning <- function()
  old_ops <- options(warn = 2)

The testthat equivalent is more readable:

  "sqrt throws a warning for negative number inputs",

Thirdly, testthat caches tests, so you spend less time waiting for tests that you know are fine to rerun.

These benefits mean that I’ve been meaning to switch packages for a while. The big problem was that the assertive package contains over 300 unit tests. At about a minute or so to update each tests, that was five hours of tedious work that I couldn’t be bothered to do. Instead, I spent two days making a package that automatically converts RUnit tests to testthat tests. Not exactly a time saving, but it was more fun.

It isn’t on CRAN yet, but you can get it from github.


The package contains functions to convert tests on an individual/file/package basis.

convert_test takes an RUnit test function and returns a call to test_that. Here’s an example for the sqrt function.

test.sqrt.3.returns_1.732 <- function()
  x <- 3
  expected <- 1.73205080756888
  checkEquals(sqrt(x), expected)
## test_that("test.sqrt.3.returns_1.732", {
##   x <- 3
##   expected <- 1.73205080756888
##   expect_equal(expected, sqrt(x))
## })

convert_test works with more complicated test functions. You can have multiple checks, nested inside if blocks or loops if you really want.

test.some_complicated_nonsense.returns_an_appropriate_testthat_test <- function()
  x <- 6:10
  for(i in 1:5)
    if(i %% 2 == 0)
      checkTrue(all(x > i), msg = "i divisible by 2") 
      if(i == 4)
        checkIdentical(4, i, msg = "i = 4")
      } else
        while(i > 0) 
          checkIdentical(2, i, msg = "i = 2")
## test_that("test.some_complicated_nonsense.returns_an_appropriate_testthat_test", 
## {
##   x <- 6:10
##   for (i in 1:5) {
##     if (i%%2 == 0) {
##       expect_true(all(x > i), info = "i divisible by 2")
##       if (i == 4) {
##         expect_identical(i, 4, info = "i = 4")
##       }
##       else {
##         while (i > 0) {
##           expect_identical(i, 2, info = "i = 2")
##         }
##         repeat {
##           expect_error(stop("!!!"))
##           break
##         }
##       }
##     }
##   }
## })

Of course, the main use for this is converting whole files or packages at at time, so runittotestthat contains convert_test_file and convert_package_tests for this purpose. By default (so you don’t overwrite your RUnit tests by mistake), they write their output to the console, but you can also write the resulting testthat tests to a file. converting all 300 of assertive’s tests was as easy as

  test_file_regexp = "^test", 
  testthat_files = paste("new-", runit_files)

In that line of code, the runit_files variable is a special name that refers to the names of the files that contain you RUnit tests. It means that the output testthat file names can be be based upon original input names.

Although runittotestthat works fine on all my tests, automatic code editing is a tricky task, so there may be some weird edge cases that I’ve missed. Please download the package and play with it, and let me know if you find any bugs.

Update: \code>runittotestthat is on CRAN these days.