A couple of days ago Pete Werner had a rant about the state of R’s documentation. A lot of it was misguided, but it had some legitimate complaints, and the fact that people can perceive R’s documentation as being bad (whether accurate or not) is important in itself.
The exponential growth in R’s popularity means that a large proportion of its user’s are beginners. The demographic also increasingly includes people who don’t come from a traditional statistics or data analysis background – I work with biologists and chemists to whom R is a secondary skill after their lab work.
All this means that I think it’s important for the R community to have an honest discussion about how it can make itself accessible for beginners, and dispel the notion that it is hard to learn.
Function help pages
Pete’s big complaint was that the
? function help pages only act as a reference; they aren’t useful for finding functions if you don’t know what you want already. This is pretty dumb; every programming language worth using has a similar function-level reference system, and they are never used for finding new functions. I happen to think that R’s function-level reference are, on the whole, pretty good. The fact that you can’t get a package submitted to CRAN without dozens of check on the documentation means that all functions have at least their usage documented, and most have some description and examples too.
Searching for functions
His complaint that it is hard to find a function when you don’t know the name carries a little more weight.
He gives the example of trying to find a function to create an identity matrix.
??identity returns many irrelevant things, but he spots the function
identity. Of course,
identity isn’t what he wants; it just returns its input, so Pete gives up.
I agree that
identity should have a “see also” link to tell people to use
diag instead if they want an identity matrix. After reading Pete’s post I filed a bug report, and 3 hours later, Martin Maechler made the change to R’s documentation. All fixed.
While’s R function-level documentation is fairly mature, there is definitely more scope for linking between pages to point people in the right direction. If you think that there is a missing link between help pages, write a comment, and I’ll file a new bug with the collated suggestions. Similarly, there are a few functions that could use better examples. Feel free to comment about those.
The failure of
??identity to find an identity matrix function is unfortunately typical.
??"identity matrix" would have been a fairer search, and it gets rid of most the rubbish, but still doesn’t find
In general, I find that
?? isn’t great, unless I’m searching for a fairly obscure term. I also don’t see an easy fix for that. Fortunately, there’s an alternative. I use Rseek as my first choice tool for finding new functions. In this case, the first result for a search for “identity matrix” is a blog post entitled “How do I Create the Identity Matrix in R?”, which gives the right answer.
When I teach R to beginners, Rseek gets mentioned in lesson one. It is absolutely fundamental to R usage. So I don’t believe that finding the right function to use is a big problem in R either, except to new users who don’t know about Rseek.
The thing is, there’s a way to fix that. Rseek, as far as I know, is entirely run by Sasha Goodman right now. If he gets hit by a bus, several million R users are going to be stuck. This is a big vulnerability to R, and I think it’s time that Rseek became an official R project.
I should also mention that R has other built-in ways of finding functions beyond
??, and as Pete linked to, Pat Burns’ guide to them is excellent.
Pete’s final complaint was that there is a lack of concept-level documentation. That is, how do you string several functions together to achieve a task?
Actually, there is a lot of concept-level documentation around; it just comes in many forms, and you have to learn what those forms are.
demo() brings up a list of demonstrations of how to do particular tasks. This command appears in the startup text when R loads, so there is no excuse for not knowing about it. There are only 16 of them though, so I think that these are worth revisiting for expansion.
browseVignettes() brings up a list of vignettes. These are short documents on a particular task. Many packages have them, and it is a good idea to read them when you start using new package.
The base-R packages, other than
Matrix, aren’t well represented with vignettes. Much of the content that would have gone into vignettes appears in the manual Introduction to R, but there is definite room for improvement. For example, a vignette on subsetting or basic plotting might stave off a few questions to the r-help mailing list.
Another point to remember is that R-core only consists of 20 people (and I’m not sure how many of those are still actively working on R), so much of the how-to documentation has been created by the users. There are a ridiculous number of free resources available; just take a look at the Stack Overflow R Tag Info page.
- R’s function level documentation is mostly very good. There are a few “see also”s missing, and some of the examples could be improved.
- The built-in facilities to find a function aren’t usually as successful as searching on Rseek. I think Rseek ought to be an official R project.
- Concept-level documentation is covered by demos and vignettes, though I think there should be a few more of these in base-R.
Update: Andrie de Vries tweeted me to say that Google has gotten better at returning R-related content, so searching for
[r] "identity matrix" returns what you want, and in fact
r "identity matrix" does too.
Over the last week or two I’ve been pushing all my packages to CRAN.
pathological (for working with file paths),
runittotestthat (for converting RUnit tests to testthat tests), and
regex, for building regular expressions in a human readable way) all make their CRAN debuts.
assertive, for run-time testing your code has more checks for the state of your R setup (
r_has_png_capability, and many more), checks for the state of your variables (
are_same_length, etc.), and utilities (
sig (for checking that your function signatures are sensible) now works with primitive functions too.
learningr (to accompany the book) has a reference URL fix but is otherwise the same.
I encourage you to take a look at some or all of them, and give me feedback.
A couple of months ago I spotted that the examples for the
paste function weren’t very good, and actually, there were quite a few functions that new users of R are likely to encounter, that weren’t well explained.
The important thing to remember is that R is a communist ecosystem: all R users are equal, except those in the ruling R Core Team Party. If you want anything to happen, you need to persuade a member of R Core.
If the change can be made by creating of adding to a package, you’ll probably find it easier to do that. R Core members all have day jobs, and get a lot of crappy requests, so expect your request to be considered very low priority.
Of course, there is a certain amount of psychology involved in this. If R Core have heard of you, then your chances increase a little. Spend some time discussing things on r-devel, upload some packages to CRAN, and say hello to them at useR. (I mostly avoid r-devel in favour Stack Overflow, spend a fair amount of time getting packages rejected from CRAN, but I have said hello to a few of them.)
There are three ways of making your request to R Core:
File a bug report
If you have found a problem that you can reproduce, and can articulate, and seems possible to fix in a reasonable amount of time, then file a bug report.
My success rate with bugs is slightly better than fifty-fifty: 12 out of 22 were closed as fixed, and one more was closed as won’t-fix but was then fixed anyway. This means that you need to be psychologically prepared for rejection: not all your fixes will be made, and you shouldn’t get upset when they don’t.
There are many pitfalls involved in filing a bug report, so if you want to stand any chance of success, then try to follow these guidelines.
– You absolutely must have a reproducible example of the behaviour that you want to change. If you can’t reproduce it, you aren’t ready to submit the bug report.
– Speculation as to the cause of a problem is a cultural no-no. Brian Ripley will tell you off if you do this. Stick to the facts.
– If the problem involves external libraries (the Windows API for example), then your chance of getting a successful change is drastically reduced, and you may want to consider other methods of contacting R Core before you file a bug report.
– The same is true if your proposed change involves changing the signature of a base R function (even if it doesn’t conflict with existing behaviour).
– Make it as easy as you can for a change to be made. Your probability of success decays exponentially with the amount of time it takes for the fix to be made. (This is true of any software project, not just R.)
Start a discussion on the r-devel mailing list
If you can’t consistently reproduce a problem, or if there are several possible behaviours that could be right, or there is some sort of trade-off involved in making a change, then you should ask here first.
Having a discussion on r-devel means that other community members get to discuss the pros and cons of your idea, and allows people to test your idea on other operating systems/versions of R, and with other use cases.
Contact the member of R Core directly
If your problem is fuzzy, and you aren’t finding any luck on r-devel, then you might want to try contacting a member of R Core directly. (I want to emphasise that you should usually try r-devel first, and email is less invasive than a phone call or turning up at their house.)
Most bits of R fall under general R Core maintenance but some specialist packages have a single maintainer, and direct contact is likely to be more successful for these.
For example, for anything related to the
grid package, you need to speak to Paul Murrell. For anything related to
lattice, you want Deepayan Sarkar. For
codetools, you want Luke Tierney. Find out who the maintainers are with:
In the end, you may use several methods of communication: I submitted a bug report about the
paste examples, and then had some follow up email with Martin Maechler about updating further examples.
In terms of time-scales, I’ve had a few bugs fixed the same day, but in general expect things to take weeks to months, especially for bigger changes.
Last month I was ranting about the state of some of the examples in base-R, particularly the
Martin Maechler has now kindly taken my suggested examples and added them into R. Hopefully this will reduce the number of newbie questions about “how do I join these strings together”.
Since Martin showed some interest in improving the state of the examples, I’ve given him updated content for another 30 or so help pages, and some Christmas homework of getting them into R too!
Then I looked at
example(paste), and it turns out that it’s not very good at all. There isn’t even an example of how to use the
collapse argument. Considering that
paste is one of the first functions that beginners come across, as well as being a little bit tricky (getting to understand the difference between the
collapse arguments takes a bit of thinking about when you are new), this seems like a big oversight.
I’ve submitted this as a bug, with a suggested improvement to the examples. Fingers crossed that R-core will accept the update, or something like it.
It got me thinking though, how many other base functions could do with better examples? I had a quick look at some common functions that beginners seems to get confused with, and the following all have fairly bad example sections:
If you have half an hour spare, have a go at writing a better example page for one of these functions, or any other function in the base distribution, then submit it to the bug tracker. (If you aren’t sure that your examples are good enough, or you need advice, try posting what you have on r-devel before submitting a bug report. Dealing with bug reports takes up valuable R-core time, so you need to be sure of quality first.)
This seems like a really easy way to make R more accessible for beginners.
Regular expressions are an amazing tool for working with character data, but they are also painful to read and write. Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.
Consequently, I’ve created the regex package for human readable regular expression generation. It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:
library(devtools) install_github("regex", "richierocks")
Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:
I need ls, with a pattern that matches punctuation. So I open the
?regex help page and look for the character class for punctuation. My first attempt is then:
ls(baseenv(), pattern = "[:punct:]")
Ok, wait, the class has to be wrapped in square brackets itself.
ls(baseenv(), pattern = "[[:punct:]]")
Better, but that’s matching S3 classes and some functions too. I want to match only where there’s punctuation at the start. What’s the anchor for the start? Back to reading
?regex. Sod it, there’s too much text here; it’s probably a dollar sign.
ls(baseenv(), pattern = "$[[:punct:]]")
Hmm, nope. Must be a caret.
ls(baseenv(), pattern = "^[[:punct:]]")
Hurrah! Still, it took me 5 minutes for a simple example. For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating. Here’s the equivalent syntax using regex.
ls(baseenv(), pattern = START %c% punct())
START; is just a constant that returns a caret. The
%c% operator is a wrapper to
punct is a function returning a group of punctuation. You can pass it argument to match multiple punctuation. For example
punct(3, 5) matches between 3 and 5 punctuation characters.
You also get lower-level functions.
punct(3, 5) is a convenience wrapper for
repeated(group(PUNCT), 3, 5).
As a more complicated example, you can match an email address like:
one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c% "@" %c% one_or_more(group(ASCII_ALNUM %c% ".-")) %c% DOT %c% ascii_alpha(2, 4)
This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.
There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation. I’m already rather excited about how much easier regular expressions have become for me to use.
apply family of functions,
rapply is the unloved ginger stepchild. While
vapply make regular appearances in my code, and
tapply have occasional cameo appearances, in ten years of R coding, I’ve never once found a good use for
Maybe once a year I take a look at the help page, decide it looks to complicated, and ignore the function again. So today I was very pleased to have found a genuine use for the function. It isn’t life-changing, but it’s quite cute.
Complex classes often have a print method that hides their internals. For example, regression models created by
glm are lists with thirty elements, but their print method displays only the call, the coefficients and a few statistics.
# From example(glm) utils::data(anorexia, package = "MASS") anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt), family = gaussian, data = anorexia) str(anorex.1)
To see everything, you need to use
unclass(anorex.1) #many pages of output
unclass has a limitation: it only removes the top level class, so subelements keep their classes. For example, compare:
class(unclass(anorex.1)$qr) # qr class(unclass(anorex.1$qr)) # list
rapply, we can remove classes throughout the whole of the object, turning it into a list of simple objects.
rapply(anorex.1, unclass, how = "replace")
As well as allowing us to thoroughly inspect the contents of the object, it also allows the object to be used with other code that doesn’t understand particular classes.