Over the last week or two I’ve been pushing all my packages to CRAN.
pathological (for working with file paths),
runittotestthat (for converting RUnit tests to testthat tests), and
regex, for building regular expressions in a human readable way) all make their CRAN debuts.
assertive, for run-time testing your code has more checks for the state of your R setup (
r_has_png_capability, and many more), checks for the state of your variables (
are_same_length, etc.), and utilities (
sig (for checking that your function signatures are sensible) now works with primitive functions too.
learningr (to accompany the book) has a reference URL fix but is otherwise the same.
I encourage you to take a look at some or all of them, and give me feedback.
the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.
You’ll get to hear about:
- Writing stylish code.
- Finding bad functions with the sig package.
- Writing robust code with the assertive package.
- Testing your code with the testthat package.
- Documenting your code with the roxygen2 package.
It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.
Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.
At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.
With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.
The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.
My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.
The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.
The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.
Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.
There’s a style guide over the page, here.
Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)
sig package helps you determine whether your code lives up to this ideal.
sig function prints a function signature.
library(sig) sig(read.csv) ## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec ## = ".", fill = TRUE, comment.char = "", ...)
So far, so unexciting. The
formals functions do much the same thing. (In fact the
sig function is just a wrapper to
formals with a pretty
It gets more interesting, when you look at the signatures of lots of functions together.
list_sigs prints all the sigs from a file or environment.
list_sigs(pkg2env(tools)) ## add_datalist <- function(pkgpath, force = FALSE) ## ## bibstyle <- function(style, envir, ..., .init = FALSE, .default = ## TRUE) ## ## buildVignettes <- function(package, dir, lib.loc = NULL, quiet = ## TRUE, clean = TRUE, tangle = FALSE) ## ## check_packages_in_dir <- function(dir, check_args = character(), ## check_args_db = list(), reverse = NULL, ## check_env = character(), xvfb = FALSE, Ncpus = ## getOption("Ncpus", 1), clean = TRUE, ...)
Even from just the first four signatures, you can see that the
tools package has a style problem.
check_packages_in_dir are lower_under_case,
buildVignettes is lowerCamelCase, and
bibstyle is plain lowercase.
I don’t want to pick on the
tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.
write_sigs is a variant of
list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.
There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)
sig package also contains a
sig_report function that identifies problem functions. This example uses the
Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.
sig_report( pkg2env(Hmisc), too_many_args = 25, too_many_lines = 200 ) ## The environment contains 509 variables of which 504 are functions. ## Distribution of the number of input arguments to the functions: ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 4 62 117 90 41 41 22 17 14 17 15 10 8 4 6 3 1 ## 17 18 19 20 21 22 23 24 27 28 30 33 35 48 66 ## 4 5 3 2 4 2 2 2 2 1 1 1 1 1 1 ## These functions have more than 25 input args: ##  dotchart2 event.chart ##  labcurve latex.default ##  latex.summary.formula.reverse panel.xYplot ##  rlegend transcan ## Distribution of the number of lines of the functions: ## 1 2 [3,4] [5,8] [9,16] [17,32] ## 1 47 15 57 98 108 ## [33,64] [65,128] [129,256] [257,512] [513,1024] ## 81 58 30 8 1 ## These functions have more than 200 lines: ##  areg aregImpute ##  event.chart event.history ##  format.df labcurve ##  latex.default panel.xYplot ##  plot.curveRep plot.summary.formula.reverse ##  print.char.list rcspline.plot ##  redun rlegend ##  rm.boot sas.get ##  summary.formula transcan