Posts Tagged ‘sig’

Webcast on Writing Great R Code

19th September, 2013 6 comments

While I’m promoting things, you might also want to know that I’m doing a webcast on how to write great R code next Wednesday. It’s at 6pm British Summer Time or 10am Pacific Daylight Time.

the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.

You’ll get to hear about:

  • Writing stylish code.
  • Finding bad functions with the sig package.
  • Writing robust code with the assertive package.
  • Testing your code with the testthat package.
  • Documenting your code with the roxygen2 package.

It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.

The Secrets of Inverse Brogramming, reprise

27th July, 2013 Leave a comment

Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.

At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.

With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.

The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.

My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.

The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.

The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.

Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.

There’s a style guide over the page, here.

Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)

The sig package helps you determine whether your code lives up to this ideal.

The sig function prints a function signature.

## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec
##         = ".", fill = TRUE, comment.char = "", ...)

So far, so unexciting. The args and formals functions do much the same thing. (In fact the sig function is just a wrapper to formals with a pretty print method.)

It gets more interesting, when you look at the signatures of lots of functions together. list_sigs prints all the sigs from a file or environment.

## add_datalist <- function(pkgpath, force = FALSE)
## bibstyle <- function(style, envir, ..., .init = FALSE, .default =
##         TRUE)
## buildVignettes <- function(package, dir, lib.loc = NULL, quiet =
##               TRUE, clean = TRUE, tangle = FALSE)
## check_packages_in_dir <- function(dir, check_args = character(),
##                      check_args_db = list(), reverse = NULL,
##                      check_env = character(), xvfb = FALSE, Ncpus =
##                      getOption("Ncpus", 1), clean = TRUE, ...)

Even from just the first four signatures, you can see that the tools package has a style problem. add_datalist and check_packages_in_dir are lower_under_case, buildVignettes is lowerCamelCase, and bibstyle is plain lowercase.

I don’t want to pick on the tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.

write_sigs is a variant of list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.

There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)

The sig package also contains a sig_report function that identifies problem functions. This example uses the Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.

  too_many_args = 25,
  too_many_lines = 200
## The environment contains 509 variables of which 504 are functions. 
## Distribution of the number of input arguments to the functions:
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
##   4  62 117  90  41  41  22  17  14  17  15  10   8   4   6   3   1 
##  17  18  19  20  21  22  23  24  27  28  30  33  35  48  66 
##   4   5   3   2   4   2   2   2   2   1   1   1   1   1   1 
## These functions have more than 25 input args:
## [1] dotchart2                     event.chart                  
## [3] labcurve                      latex.default                
## [5] latex.summary.formula.reverse panel.xYplot                 
## [7] rlegend                       transcan                     
## Distribution of the number of lines of the functions:
##          1          2      [3,4]      [5,8]     [9,16]    [17,32] 
##          1         47         15         57         98        108 
##    [33,64]   [65,128]  [129,256]  [257,512] [513,1024] 
##         81         58         30          8          1 
## These functions have more than 200 lines:
##  [1] areg                         aregImpute                  
##  [3] event.chart                  event.history               
##  [5] format.df                    labcurve                    
##  [7] latex.default                panel.xYplot                
##  [9] plot.curveRep                plot.summary.formula.reverse
## [11] print.char.list              rcspline.plot               
## [13] redun                        rlegend                     
## [15] rm.boot                      sas.get                     
## [17] summary.formula              transcan

Get every new post delivered to your Inbox.

Join 160 other followers