Home > R > Thoughts on R’s Terrible, Horrible, No Good, Very Bad Documentation

Thoughts on R’s Terrible, Horrible, No Good, Very Bad Documentation

Book cover from Alexander and the Terrible, Horrible, No Good, Very Bad Day by Judith Viorst
A couple of days ago Pete Werner had a rant about the state of R’s documentation. A lot of it was misguided, but it had some legitimate complaints, and the fact that people can perceive R’s documentation as being bad (whether accurate or not) is important in itself.

The exponential growth in R’s popularity means that a large proportion of its user’s are beginners. The demographic also increasingly includes people who don’t come from a traditional statistics or data analysis background – I work with biologists and chemists to whom R is a secondary skill after their lab work.

All this means that I think it’s important for the R community to have an honest discussion about how it can make itself accessible for beginners, and dispel the notion that it is hard to learn.

Function help pages

Pete’s big complaint was that the ? function help pages only act as a reference; they aren’t useful for finding functions if you don’t know what you want already. This is pretty dumb; every programming language worth using has a similar function-level reference system, and they are never used for finding new functions. I happen to think that R’s function-level reference are, on the whole, pretty good. The fact that you can’t get a package submitted to CRAN without dozens of check on the documentation means that all functions have at least their usage documented, and most have some description and examples too.

Searching for functions

His complaint that it is hard to find a function when you don’t know the name carries a little more weight.

He gives the example of trying to find a function to create an identity matrix. ??identity returns many irrelevant things, but he spots the function identity. Of course, identity isn’t what he wants; it just returns its input, so Pete gives up.

I agree that identity should have a “see also” link to tell people to use diag instead if they want an identity matrix. After reading Pete’s post I filed a bug report, and 3 hours later, Martin Maechler made the change to R’s documentation. All fixed.

While’s R function-level documentation is fairly mature, there is definitely more scope for linking between pages to point people in the right direction. If you think that there is a missing link between help pages, write a comment, and I’ll file a new bug with the collated suggestions. Similarly, there are a few functions that could use better examples. Feel free to comment about those.

The failure of ??identity to find an identity matrix function is unfortunately typical. ??"identity matrix" would have been a fairer search, and it gets rid of most the rubbish, but still doesn’t find diag.

In general, I find that ?? isn’t great, unless I’m searching for a fairly obscure term. I also don’t see an easy fix for that. Fortunately, there’s an alternative. I use Rseek as my first choice tool for finding new functions. In this case, the first result for a search for “identity matrix” is a blog post entitled “How do I Create the Identity Matrix in R?”, which gives the right answer.

When I teach R to beginners, Rseek gets mentioned in lesson one. It is absolutely fundamental to R usage. So I don’t believe that finding the right function to use is a big problem in R either, except to new users who don’t know about Rseek.

The thing is, there’s a way to fix that. Rseek, as far as I know, is entirely run by Sasha Goodman right now. If he gets hit by a bus, several million R users are going to be stuck. This is a big vulnerability to R, and I think it’s time that Rseek became an official R project.

I should also mention that R has other built-in ways of finding functions beyond ??, and as Pete linked to, Pat Burns’ guide to them is excellent.

Concept-level documentation

Pete’s final complaint was that there is a lack of concept-level documentation. That is, how do you string several functions together to achieve a task?

Actually, there is a lot of concept-level documentation around; it just comes in many forms, and you have to learn what those forms are.

demo() brings up a list of demonstrations of how to do particular tasks. This command appears in the startup text when R loads, so there is no excuse for not knowing about it. There are only 16 of them though, so I think that these are worth revisiting for expansion.

browseVignettes() brings up a list of vignettes. These are short documents on a particular task. Many packages have them, and it is a good idea to read them when you start using new package.

The base-R packages, other than grid and Matrix, aren’t well represented with vignettes. Much of the content that would have gone into vignettes appears in the manual Introduction to R, but there is definite room for improvement. For example, a vignette on subsetting or basic plotting might stave off a few questions to the r-help mailing list.

Another point to remember is that R-core only consists of 20 people (and I’m not sure how many of those are still actively working on R), so much of the how-to documentation has been created by the users. There are a ridiculous number of free resources available; just take a look at the Stack Overflow R Tag Info page.

tl;dr

  1. R’s function level documentation is mostly very good. There are a few “see also”s missing, and some of the examples could be improved.
  2. The built-in facilities to find a function aren’t usually as successful as searching on Rseek. I think Rseek ought to be an official R project.
  3. Concept-level documentation is covered by demos and vignettes, though I think there should be a few more of these in base-R.

Update: Andrie de Vries tweeted me to say that Google has gotten better at returning R-related content, so searching for [r] "identity matrix" returns what you want, and in fact r "identity matrix" does too.

Advertisements
  1. Bid Shader
    11th March, 2015 at 0:54 am

    For a long time, I actively encouraged new people to learn R, but recently I’ve stopped doing so.

    Originally I thought this was because of the documentation – new users were often absolutely lost and frustrated by the quirks of the language, and I felt bad for introducing them to it when the documentation was so clearly haphazard. Thinking about this some more, I think that it’s impossible to criticize the documentation without criticizing the language itself. Yes, R’s documentation is a bit of a mess, but that’s because the language is a bit of a mess too. Old and new style graphics, old and new data management structures, old and new class systems. Packages that have identically named functions that mask themselves in the order in which they’re sourced. Some packages have vignettes, some don’t. Experienced users know all this, but inexperienced users often learn the hard way, unless they give up and obtain a copy of MATLAB.

    But some R people want it both ways – they like the attention grabbing nature of constant growth in the language but they don’t want to (or more likely: can’t) do the work to make the ecosystem more robust or actually lay the foundation to support those new users. It’s understandable – writing documentation is boring, thankless work and wouldn’t be cheap.

    But in the end, I think the community has made its bed. There are ~6000 packages in CRAN. Most of these should probably be on R-Forge, leaving the really robust and well supported packages in CRAN, where proper linked documentation can be put together without crippling investment in money and time. Unfortunately, it’s too late for that. 6000 packages is a source of pride for the growth of the language, but it’s also completely bewildering for new users, and absolutely crippling when you try to think about linked HTML documentation.

    Anyway, this is rambling now. My point was that it’s not just R’s documentation that has issues, it’s the entire language and community that has issues. I’ll continue to use the language myself, but I’ve stopped recommending it to others. R is just not a user friendly language any more. It’s too powerful, and too expansive, and it’s growing too fast.

    • 11th March, 2015 at 6:48 am

      Thanks for sharing your thoughts. I don’t agree that R being too powerful and growing too fast is a reason to not recommend it to others. Surely these are good things?

      On the other hand, finding the right package for the task is often a real problem – it’s a shame that http://crantastic.com isn’t more widely used. There are some areas where a few package maintainers really need to get together and merge their work to simplify things.

      • Bid Shader
        11th March, 2015 at 13:43 pm

        Right, I agree with you. My point wasn’t so much that I don’t recommend it because of the growth (that’s a great thing about R – it’s much more useful to me now than it was 5 years ago purely for this reason). My point was that the growth hasn’t been handled properly (not an insult to those involved – no one could have “planned” for such incredible growth), and so the ecosystem is now extremely intimidating to new users precisely for the reasons noted by me and many, many others.

        I simply can’t accept that the R documentation is good. If it works for you, then great (it usually works for me too, although with occasional frustrations), but I’ve heard too many new users mention valid issues for me to just say “if it works for me, the problem must be you, not the documentation”.

        • Bid Shader
          11th March, 2015 at 13:48 pm

          Just to follow up on this again – I should clarify that I am using R in an institutional environment. If, instead, we were saying “what’s the best language for a data science hacker or a home based quantitative trader?”, then I have no problem saying R dominates and the combination of documentation + web resources is more than sufficient.

          If on the other hand, we’re really talking about getting things done in a corporate environment, then this is something that needs to be dealt with, or else another language will come along and eat R’s lunch. If MATLAB wasn’t so ridiculously expensive, then that could work, but as it stands, R is not quite ready for prime time.

    • 11th March, 2015 at 12:36 pm

      The problem, as always, comes back to the fact that S, thence R, sought to implement both stat commands and a stat “programming language” in one syntax. Two distinct semantics in one syntax is guaranteed to fail at both. Moreover, without a BDFL to enforce structural rules (one may not like java’s hierarchical paradigm, but at least one knows what a . means), any bit of code’s name may be virtually anything. Traversing a thought out structure to find something you need, but don’t know whether it exists, is worth something.

      Task Views on CRAN is a step, but a bit late.

      Add in the sorta-kinda OO and sorta-kinda Functional memes, and one might get the impression that R folks have been seeking the mantels of au courant languages all the while continuing to produce FORTRAN in a C-like syntax.

      All of that said, with the purest of heart by the way, R is still a better choice than the mega-bucks alternatives. And not just because of the moolah difference, but because R users, to a greater extent than users of those other packages, seek to build open/reproducible analyes. That’s important. If only we could lock the Bayesians in the dungeon…

  2. 11th March, 2015 at 5:44 am

    Good points. Furthermore:

    install.packages(“sos”)
    library( sos )
    ???”identity matrix” # Surprise

    Cheers, Walter.

    • 11th March, 2015 at 7:26 am

      I like the SOS package (in fact I’ve contributed to it), but ??? is awful for finding lots of completely unrelated functions. On my machine, ???’identity matrix’ found 2210 matches, and base::diag didn’t make the top 400 that show up in the browser.

      I think internet searches have an unstoppable advantage over on machine searches for discovering new functions.

  3. nicholastierney
    11th March, 2015 at 6:14 am

    Holy Crap! I’ve never seen Rseek before. Thanks!

  4. Steven S.
    11th March, 2015 at 9:46 am

    A thing that’s really missing in theonline docs (be it vignettes, packages reference, function reference) is images!!! Why are all graphical examples without the graphical result???

  5. Steven S.
    11th March, 2015 at 9:50 am

    @bid shader:
    I couldn’t agree more with your points about CRAN & function imports from packages overriding each other without any good errors or tracing. If R want to keep growing much more I believe CRAN & the whole package system itself will need to be redesigned. I do not believe this will happen so that’s why (& other reasons like performance,clarity, parallelism,…) I’m betting on Julia in the long run.

  6. Bernhard
    11th March, 2015 at 14:07 pm

    I use `?` often. Mostly to see, whether a parameter is called “na.rm”, “na.omit”, “pairwise.complete” ,… Those are quirks of the language as well as the use of `.` and `_` and camelCase and so on that make me look up in `?` more often than it should be. Have a look at the great work of “R Inferno” for even more problems that good documentation has to compensate for. `?` is often proposed to newbies, but it’s rarely usefull for newbies.

    When I was new, I was struggeling with lots of examples in `?`, because they used some integrated data set and did not tell, how to prepare my data to be useful. Ok, I grew into it and it learned to read the list of arguments to find that out. When looking into new packages I try to read the vignettes – but more often than not, I find no vignette for a specific package on CRAN. Providing usefull and beginner friendly vignettes should definitely be promoted!

    I personnally learned statistics and R at the same time. I can’t tell, how often that happens. But help-pages rely on the user understandig quite some of the statistics behind it. I can see now, how that ist good for the statistics literate, but if R is becoming the universal statistics solution for medical doctors, biologists, students, … it might be a point to be considered.

    I do use google quite a bit, when working with R. Maybe I’ll try Rseek more in the future. But I guess that’s what newbies use instead of the help-pages and many have contributed a lot of newbie-friendly documentation in the net (name Quick-R as an example) but I can understand the wish for some inbuild newbie-friendly documentation within R.

  7. 12th March, 2015 at 12:54 pm

    The R built-in documentation is simply a function reference. It is not only useful but absolutely crucial for developers building packages for R, based on existing R functions. It is however, less useful for the novice.

    Besides the dcumentation aready mentioned for concept-level documentation, there are many books entitled something like “[xxx] with R”. If you’re interested in say, machine learning or time series, I think it’s only natural you’d buy at least one or a couple of authoritative textbooks, so why not for the R-based applications? You simply can’t expect the R-core to write concept-level documentation for every possible application, just like no-one expects something like that from the committee that oversees development of the C-standard.

    Of course one could argue that such documentation should be free (as in free beer) but that’s an off-topic discussion. Main point is: concept-level documentation _is_ available in many cases. You just sometimes have to pay a couple of bucks for it.

    A discussion of what books to read and what books not to read along with the other documentation options is a standard part of the R-courses I teach.

  8. sidjsb
    12th March, 2015 at 15:04 pm

    There are also higher level task view pages that show the similar packages with objective information for and against like this one:
    http://cran.r-project.org/web/views/Optimization.html

    Also similar things for development like R-forge.

    really nothing can beat an “r cran ” google search

    For starting out I found that functions being smarter than you think is hard to understand such as the bracket operator being so heavily overloaded.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: