Archive

Posts Tagged ‘r’

useR2011 Easy interactive ggplots talk

17th August, 2011 10 comments

I’m talking tomorrow at useR! on making ggplots interactive with the gWidgets GUI framework. For those of you at useR, here is the code and data, so you can play along on your laptops. For everyone else, I’ll make the slides available in the next few days so you can see what you missed. Note that for confidentiality reasons, I’ve added some random junk to the sample values, so please don’t use this data for actual research. (If you are interested in chemical exposure data, contact the lab.)

Chromium data. Nickel data. Code for examples 1 to 4. Once we introduce example 5, the rest of the code needs a little reworking.

Update: The permissions on the code files were blocking downloads from people who aren’t me. My bad. It should be fixed now.

Another update: By popular demand, here are the code chunks.

library(ggplot2)
library(gWidgetstcltk)

chromium <- read.csv("chromium.csv")
nickel <- read.csv("nickel.csv")

p <- ggplot(chromium, aes(air, bm)) +
  geom_point()

win_ctrls <- gwindow("Plot controls 1-4")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
  Linear      = "identity", 
  Log           = "log10"
)   
  
frm_scale_trans_y <- gframe(
  "Y scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_y <- gradio(
  names(available_scales), 
  container = frm_scale_trans_y,
  handler = function(h, ...)
  {
    scale_trans_y <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_y_continuous(
        trans = scale_trans_y
      )   
    print(p)  
  }
)

frm_scale_trans_x <- gframe(
  "X scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_x <- gradio(
  names(available_scales), 
  container = frm_scale_trans_x,
  handler = function(h, ...)
  {
    scale_trans_x <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_x_continuous(
        trans = scale_trans_x
      )   
    print(p)  
  }
)

#2 Changing facets
facet_choices <- list(
  None = ". ~ .",
  "RPE" = ". ~ rpe",
  "Welding type" = ". ~ welding.type",
  "RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
  "Faceting", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_facets <- gradio(
  names(facet_choices), 
  container = frm_facets,
  handler = function(h, ...)
  {
    facet_formula <- 
      facet_choices[[svalue(h$obj)]]
    p <<- p + 
      facet_grid(facet_formula)
    print(p) 
  }
)

#3 Changing the dataset                     
frm_datasets <- gframe(
  "Datasets", 
  container = grp_ctrls,
  expand = TRUE
)  

cmb_datasets <- gcombobox(
  c("chromium", "nickel"), 
  container = frm_datasets,
  handler = function(h, ...)
  {
    p <<- p %+% get(svalue(h$obj)) 
    print(p)  
  }
)

#4 Changing the title
frm_title <- gframe(
  "Plot title", 
  container = grp_ctrls,
  expand = TRUE
)  

txt_title <- gedit(
  p$options$title,
  container = frm_title,
  handler = function(h, ...)
  {  
    p <<- p + opts(title = svalue(txt_title))
    print(p)
  }
)   

library(ggplot2)
library(gWidgetstcltk)

chromium <- read.csv("chromium.csv")
nickel <- read.csv("nickel.csv")

p <- ggplot(chromium, aes(air, bm)) +
  geom_blank()
  
print_p <- function()
{
  pp <- get("p", envir = globalenv())
  if(svalue(chk_points)) pp <- pp + geom_point()  
  if(svalue(chk_smooth)) pp <- pp + geom_smooth()
  print(pp)
}

win_ctrls <- gwindow("Plot controls 1-5")
grp_ctrls <- ggroup(container = win_ctrls, horizontal = FALSE)

#1 Changing scales
available_scales <- c(
  Linear      = "identity", 
  Log           = "log10"
)   
  
frm_scale_trans_y <- gframe(
  "Y scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_y <- gradio(
  names(available_scales), 
  container = frm_scale_trans_y,
  handler = function(h, ...)
  {
    scale_trans_y <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_y_continuous(
        trans = scale_trans_y
      )   
    print_p()  
  }
)

frm_scale_trans_x <- gframe(
  "X scale transformation", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_scale_trans_x <- gradio(
  names(available_scales), 
  container = frm_scale_trans_x,
  handler = function(h, ...)
  {
    scale_trans_x <- available_scales[svalue(h$obj)]  
    p <<- p + 
      scale_x_continuous(
        trans = scale_trans_x
      )   
    print_p()  
  }
)

#2 Changing facets
facet_choices <- list(
  None = ". ~ .",
  "RPE" = ". ~ rpe",
  "Welding type" = ". ~ welding.type",
  "RPE and Welding type" = "rpe ~ welding.type"
)

frm_facets <- gframe(
  "Faceting", 
  container = grp_ctrls,
  expand = TRUE
)  

rad_facets <- gradio(
  names(facet_choices), 
  container = frm_facets,
  handler = function(h, ...)
  {
    facet_formula <- 
      facet_choices[[svalue(h$obj)]]
    p <<- p + 
      facet_grid(facet_formula)
    print_p() 
  }
)

#3 Changing the dataset                     
frm_datasets <- gframe(
  "Datasets", 
  container = grp_ctrls,
  expand = TRUE
)  

cmb_datasets <- gcombobox(
  c("chromium", "nickel"), 
  container = frm_datasets,
  handler = function(h, ...)
  {
    p <<- p %+% get(svalue(h$obj)) 
    print_p()  
  }
)

#4 Changing the title
frm_title <- gframe(
  "Plot title", 
  container = grp_ctrls,
  expand = TRUE
)  

txt_title <- gedit(
  p$options$title,
  container = frm_title,
  handler = function(h, ...)
  {  
    p <<- p + opts(title = svalue(txt_title))
    print_p()
  }
)

#5 Changing geoms
frm_geoms <- gframe(
  "Add/remove geoms", 
  container = grp_ctrls,
  expand = TRUE
)  

chk_points <- gcheckbox(
  "Points", 
  container = frm_geoms,
  handler = function(h, ...) print_p()  
)      
chk_smooth <- gcheckbox(
  "Smooth curve", 
  container = frm_geoms,
  handler = function(h, ...) print_p()  
)     

Stop! (In the name of a sensible interface)

12th August, 2011 1 comment

In my last post I talked about using the number of lines in a function as a guide to whether you need to break it down into smaller pieces. There are many other useful metrics for the complexity of a function, most notably cyclomatic complexity, which tracks the number of different routes that code can take. It’s non-trivial to calculate such a measure, and it seems that there is nothing currently available to calculate it for R functions. (The internet is curently on the case.) For now, we’ll use an easier, simpler measure of the complexity of a function: how many times if, ifelse or switch is called.

Let’s take a look at how complex the contents of base R are. First, as in the previous post, we need to retrieve all the functions. Since I seem to be trying to do this regularly, I’m wrapping the code into a function.

get_all_fns <- function(pattern = ".+")
{
  fn_names <- apropos(pattern)
  fns <- lapply(fn_names, get)
  names(fns) <- fn_names
  Filter(is.function, fns)
}
fns <- get_all_fns()

As before, we use deparse to turn the function’s body into an array of strings to examine. This time, we are looking for calls to if, ifelse or switch.

   
get_complexity <- function(fn) 
{
  body_lines <- deparse(body(fn))  
  flow <- c("if", "ifelse", "switch")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
complexity <- sapply(fns, get_complexity)

Let’s take a look at the distribution of this complexity measure.

     
library(ggplot2)
hist_complexity <- ggplot(data.frame(complexity = complexity), aes(complexity)) + 
  geom_histogram(binwidth = 3) 
hist_complexity 

Histogram of complexity, by number of calls to if, ifelse or switch

Zero cases is the most common, which is nice to see, but we have some serial offenders over on the right hand side of the plot. Let’s see who the culprits are.

head(sort(complexity, decreasing=TRUE))
       library          arima    help.search       read.DIF         coplot [<-.data.frame 
            84             81             71             66             65             63

Hmm, it’s the same set of functions from the monster-function list before. This is to be expected in some ways, though it would be nicer if we had another measure to pick out dubious functions. One such measure that springs to mind is the number of exceptions that can be thrown. This is quite a subtle measure to read, since in general, code should “fail early and fail often”. That is, you want lots of exceptions to catch any problems, and you want them to be thrown as soon as possible, so you don’t waste time calculating things that were going to fail anyway. Thus more possible exceptions is better, except that too many means that if so many things can go wrong, then your function is too complicated.

Finding the number of possible exceptions works exactly the same as our previous example, only this time we look for calls to stop and stopifnot.

   
get_n_exceptions <- function(fn) 
{
  body_lines <- deparse(body(fn))  
  flow <- c("stop", "stopifnot")
  rx <- paste(flow, " *\\(", collapse = "|", sep = "")
  body_lines <- body_lines[grepl(rx, body_lines)]
  length(body_lines)
}
n_exceptions <- sapply(fns, get_n_exceptions)

Once again we examine the distribution …

  
hist_exceptions <- ggplot(data.frame(n_exceptions = n_exceptions), aes(n_exceptions)) + 
  geom_histogram(binwidth = 1) 
hist_exceptions 

Histogram of number of exceptions

and it seems that most code contains no exception throwing code. This is acceptable for non-user facing functions, since user input is the biggest cause of problems.

head(sort(n_exceptions, decreasing=TRUE)) 
      read.DIF        library [<-.data.frame          arima         arima0        glm.fit 
            17             16             15             14             13             13           

The function with the most potential exceptions to throw is read.DIF. File handling is notoriously problematic, so that’s fair enough. Load the survival package for a better example. The Surv function lets you define a censored vector, and it has an interface that’s either really clever or stupidly complicated. You can specify the censoring in many different ways, so the error checking gets rather complicated, and then it requires 20 calls to stop to prevent disaster.

So when you are writing a function and you see the 20th call to stop, that’s a hint that you may need to stop (if you want a sensible interface).

Monster functions (Raaargh!)

12th August, 2011 Leave a comment

It’s widely considered good programming practice to have lots of little functions rather than a few big functions. The reasons behind this are simple. When your program breaks, it’s much nicer to debug a five line function than a five hundred line function. Additionally, by breaking up your code into little chunks, you often find that some of those chunks are reusable in other contexts, saving you re-writing code in your next project. The process of breaking your code down into these smaller chunks is called refactoring.

The concept of a line of code is surprisingly fluid in R. Since you can add whitespace more or less where you like, the same code can take up one line in your editor of hundreds, if you so choose. Assuming that most programmers will write in a reasonably standard way, we can get a rough idea of how many lines there are in an R function by calling deparse on its body. deparse is less scary than it sounds. Parsing means turning a load of text into something meaningful; thus deparsing means turning something meaningful into a load of text. deparse essentially works like as.character for expressions. (Actually, you can call as.character on expressions, but the results are often dubious.)

A very interesting question is “how much of base R could do with refactoring into smaller pieces?”. To answer this, our first task is to get all the functions.

 
fn_names <- apropos(".+")
fns <- lapply(fn_names, get)
names(fns) <- fn_names
fns <- Filter(is.function, fns)       

apropos finds all the functions on your search path (i.e., from all the packages that have been loaded). Try this code with a freshly loaded version of R, and again with all your packages loaded. The function below will do that for you.

load_all_packages <- function()
{
  invisible(sapply(
    rownames(installed.packages()),
    require,
    character.only = TRUE
  ))
}
load_all_packages()

The number of lines in each function is very straightforward to get from here.

n_lines_in_body <- function(fn) 
{
  length(deparse(body(fn)))
}
n_lines <- sapply(fns, n_lines_in_body)

Let’s take a look at the distribution of those lengths.

     
library(ggplot2)
hist_line_count <- ggplot(data.frame(n_lines = n_lines), aes(n_lines)) + 
  geom_histogram(binwidth = 5) 
hist_line_count

Histogram of number of lines in functions
So about half the functions are five lines or less, which is all well and good. Notice that the x-axis extends all the way over to 400 though, so there clearly are some monsters in there.

head(sort(n_lines, decreasing = TRUE))

      library         arima   help.search        coplot loadNamespace       plot.lm 
          409           328           320           316           305           299

So library is the number one culprit for being over long and complicated. In fairness to it though, it does mostly consist of sub-functions, so there clearly has been a lot of refactoring done on it; its just that the individual bits are contained within it rather than elsewhere. arima is more of a mess; it looks like the code is so old that no-one dare touch it anymore. None of these functions are really bad though. To see a package that really need some refactoring work, load up Hmisc and rerun this analysis. Now eight of the top ten longest functions come from this package. Quick challenge for you: hunting through other packages, can you find a function that beats Hmisc’s transcan at 591 lines?

Tags: ,

The Stats Clinic

27th July, 2011 Leave a comment

Stats clinic logo
Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.

While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.

Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.

I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.

Tags: ,

The method in the mirror: reflection in R

17th July, 2011 Leave a comment

Reflection is a programming concept that sounds scarier than it is. There are three related concepts that fall under the umbrella of reflection, and I’ll be surprised if you haven’t come across most of these code ideas already, even if you didn’t know it was called reflection.

The first concept is examination of your variables. In R, this mostly means calling class, attributes, str and summary. S4 classes also have the function showMethods to, ahem, show their methods.

The second concept is accessing variables by name; in R this means calling get or getAnywhere, the latter being used for functions that aren’t exported from a package namespace. (Start by using get; if that doesn’t work, try getAnywhere.)

#without reflection
mean(1:5) 

#with reflection
get("mean")(1:5)
getAnywhere("mean")(1:5)

The main use of this is with functions that return names of functions, like ls. For example, to retrieve every local variable in list form, use

lapply(ls(), get)

Again, it’s a tiny bit more complicated for S4 classes, which have a variety of extra functions for inspecting them. There’s getMethod, getSlots and a bunch of other functions. Try apropos("^get") to find them.

The third concept is to evaluate code in string form. The mean example from above becomes

eval(parse(text = "mean(1:5)"))

A word of warning about this last concept. It is very powerful, but also one of the easiest ways to write completely incomprehensible buggy code. Don’t use it, except in the last resort.

So there you have it, reflection in three easy steps.

Tags: ,

Testing for valid variable names

4th July, 2011 Leave a comment

I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.

Not having an is_valid_variable_name function is one of those odd omissions from R, and the assign function doesn’t check validity.

To recap, there are a few rules on what makes a valid variable name.  From ?name

Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).

The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of is* type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of

is_something <- function(x)
{
  if(!some_condition) return(FALSE)
  if(!some_other_condition) return(FALSE)
  #etc.
  return(TRUE)
}

The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:

is_something <- function(x)
{
  ok
  ok[!some_condition] <- FALSE
  ok[!some_other_condition] <- FALSE
  #etc.
  ok
}

So, back to our is_valid_variable_name function. The first condition is easy to implement.

is_valid_variable_name <- function(x) 
{
  ok
  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  #More logic still to come
}

Now it gets trickier. In ?make.names we have

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.

When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.

In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The MSToolkit package contains a function validNames which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.

"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*$"

That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]“), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]“), then finish (“$”).

The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters áçöíþ <- 1 that the regex a-zA-Z won’t pick up. there’s an easy fix here that nearly always works. We swap 0-9a-zA-Z for [:alnum:] and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters µ but the “alpha” regex won’t match them.

grepl("[[:alpha:]]", "µ") # FALSE

Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.

Underscores aren't permitted...

validNames("foo_bar")  #throws error

and neither are names consisting only of dots...

validNames("..")       #throws error

but many of the reserved words (see ?Reserved for the list) are:

validNames("if")       #TRUE

I don't want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, ..., and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don't.

is_valid_variable_name <- function(x, allow_reserved = TRUE)  
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #is it a reserved word?
  reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_")
  ok[grepl(paste(reserved_words, collapse = "|"), x)]
  
  #are there any illegal characters?
  ok[!grepl("^[[:alnum:]_.]+$", x)] <- FALSE

  #does it start with underscore?
  ok[grepl("^_", x)] <- FALSE

  #does it start with dot then a number?
  ok[grepl("^\\.[[:digit:]]", x)] <- FALSE

  ok
}

So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. make.names, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then x will be the same as make.names(x). As a bonus, we can easily check for unique variable names.

is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE) 
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #are names valid (and maybe unique)
  ok[x != make.names(x, unique = unique)] <- FALSE

  ok
}

While this answer isn't quite as satisfactory because you can't see what's going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then make.names will hopefully be updated to match it. And that makes it good enough for me.

Tracking execution paths

18th June, 2011 Leave a comment

Earlier this week, I was trying to figure out the path of execution through a big chunk of code. Once you reach a certain size of codebase, tracking which function gets called when can be tricky.

My first thought for dealing with this was to add a message line at the start of each function that I wanted to track. (Note: message, not cat!)

f <- function(...)
{
  message("In 'f'")
  #the function contents...
}

f()

Of course, I don’t always want to track these functions, so to save commenting and uncommenting lines of code, each function needed a verbose argument, to make it optional.

f <- function(..., verbose = getOption("verbose"))
{
  if(verbose) message("In 'f'")        
  #the function contents...
}

f(verbose = TRUE)

This code was still a little painful, because I had to change the name of each the function inside each message. Ideally, the message should just know what function it was inside. To get this, some examination of the call stack by sys.call was required.

function_name <- function(which = -1L)
{
  sc <- sys.call(which); 
  as.character(substitute(sc))[1L]
}

entry_message <- function(verbose = getOption("verbose"))
{
  if(verbose) 
  {
    fname <- function_name(-2L)
    message("In", dQuote(fname))
  }
}

f <- function(..., verbose = getOption("verbose"))
{
  entry_message(verbose)        
  #the function contents...
}

f(verbose = TRUE)

This worked, and was fast and was relatively easy to set up, since it is the same code that is being added to each function, making tracking multiple functions easy, as shown in the more complicated (and convoluted) example below.

f <- function(i = 1L, verbose = getOption("verbose"))
{
  entry_message(verbose)        
  if(i > 5) return()
  g(i, verbose)
}

g <- function(i, verbose)
{
  entry_message(verbose)        
  if(i < 3L) h(i, verbose) else f(i + 1L, verbose)
}

h <- function(i, verbose)
{
  entry_message(verbose)        
  g(i + 1L, verbose)
}

f(verbose = TRUE)

It still bugged me though. The fact that analytical code was being polluted with logging code didn’t seem semantically right. What I really wanted was a way to track these functions without inserting any code into them at all. An hour or so of digging about in the manuals and I discovered that I didn’t want to track the functions. In R parlance, I wanted to trace them. And inevitably, there was a trace function to do just that. trace has two different implementations; which version you get depends upon whether more than one argument is passed.

#Two argument version:
trace(c("f", "g", "h"), {})
f(verbose = FALSE)

#One argument version only works for a single function
#so it needs to be wrapped in lapply
lapply(c("f", "g", "h"), trace)
f(verbose = FALSE)

If you want to trace lots of functions, it may be easier to specify them with a regular expression, via apropos.

library(ggplot2)
trace(apropos("geom_"), {})
p <- ggplot(diamonds, aes(carat, price)) + 
  geom_point() + 
  geom_smooth()

Tags: , ,

A clock utility, via console hackery

11th May, 2011 Leave a comment

A discussion on StackOverflow today shows an interesting use of special characters inside the cat function.

The most common special characters that you may have come across are the tab and newline characters, represented by \t and \n respectively. Try them for yourself.

cat("Red\tlorry\nYellow\tlorry\n")

cat also respects the backspace character, \b, and the carriage return character, \r, which means that you can delete things. Usually, this isn’t very useful, but it allows us to overwrite text in the console.

cat("ab\bc")      #\b removes the previous character
cat("abc\rde")    #\r removes everything to the start of the line

Here’s a simple clock utility, adapted from Mark, Zach and DWin’s code in the linked question, that uses this technique.

clock <- function(format = "%H:%M:%S", refresh = 1)
{
  repeat
  {
    cat("\r", format(Sys.time(), format), sep = "")
    flush.console()
    Sys.sleep(refresh)
  }
}
clock()
clock("%A %d %B %Y %I:%M:%OS3 %p", 1e-3)

Press escape to exit the clock utility. You can see a complete list of special characters over at asciitable.com.

EDIT: clock function now with customisable formatting.
ANOTHER EDIT: Refresh rate now updateable; extra example.

Tags: , ,

Friday Function: nclass

6th May, 2011 2 comments

When you draw a histogram, an important question is “how many bar should I draw?”. This should inspire an indignant response. You didn’t become a programmer to answer questions, did you? No. The whole point of programming is to let your computer do your thinking for you, giving you more time to watch videos of fluffy kittens.

Fortunately, R contains three functions to automate the answer, namely nclass.Sturges, nclass.scott and nclass.FD. (FD is short for Freedman-Diaconis; watch out for the fact that scott isn’t capitalised.)

The differences depend upon length and spread of data. For longer vectors, Scott and Freedman-Diaconis tend to give bigger answers.

short_normal <- rnorm(1e2) 
nclass.Sturges(short_normal)      #8
nclass.scott(short_normal)        #8
nclass.FD(short_normal)           #12

long_normal <- rnorm(1e5) 
nclass.Sturges(long_normal)       #18
nclass.scott(long_normal)         #111
nclass.FD(long_normal)            #144

For strongly skewed data, you are best to use some sort of transformation before you draw a histogram, but for the record, Freedman-Diaconis again gives bigger answers for highly skewed (and thus wider) vectors.

short_lognormal <- rlnorm(1e2) 
nclass.Sturges(short_lognormal)   #8
nclass.scott(short_lognormal)     #9
nclass.FD(short_lognormal)        #20

long_lognormal <- rlnorm(1e5) 
nclass.Sturges(long_lognormal)    #18
nclass.scott(long_lognormal)      #443
nclass.FD(long_lognormal)         #1134

My feeling is that since each of the three algorithms is rather dumb, it is safest to calculate all three, then pick the middle one.

nclass.all <- function(x, fun = median)
{
  fun(c(
    nclass.Sturges(x), 
    nclass.scott(x),
    nclass.FD(x)
  ))
}

log_islands 
hist(log_islands, breaks = nclass.all(log_islands))

I also wrote a MATLAB implementation of this a couple of years ago.

It is worth noting that ggplot2 doesn’t accept a number-of-bins argument to geom_histogram, because

In practice, you will need to use multiple bin widths to
discover all the signal in the data, and having bins with
meaningful widths (rather than some arbitrary fraction of the
range of the data) is more interpretable.

That’s fine if you are interactively exploring the data, but if you want a purely automated solution, then you need to make up a number of bins.

calc_bin_width <- function(x, ...)
{
  rangex <- range(x, na.rm = TRUE)
  (rangex[2] - rangex[1]) / nclass.all(x, ...)
}

p <- ggplot(movies, aes(x = votes)) +
  geom_histogram(binwidth = calc_bin_width(log10(movies$votes))) + 
  scale_x_log10()
p

Tags: , ,

Friday function triple bill: with vs. within vs. transform

29th April, 2011 7 comments

When you first learnt about data frames in R, I’m sure that, like me, you thought “This is a lot of hassle having to type the names of data frames over and over in order to access each column”.

library(MASS)
anorexia$wtDiff <- anorexia$Postwt - anorexia$Prewt #I have to type anorexia how many times?

Indeed, any time you see chunks of code repeated over and over, there’s an indication that they need rewriting. Thus the first time you discovered the attach function was a blissful moment. Ah, the hours you would save by not typing variable names! Alas, those hours were more than made up for by the hundreds of hours you spent debugging impenetrable buggy code that was a side effect of attach.

attach(anorexia)
anorexia$wtDiff <- Postwt - Prew   #Deliberate typo!
detach(anorexia)

In the above snippet of code, the typo causes execution to halt after the second line, so the call to detach never happens. Then, after you’ve fixed the typo and run the code again, anorexia is on your search path twice. This is problematic because when you detach it, there is still a copy of the data frame on the search path. Cue wailing and gnashing of teeth as you waste half an hour trying to find the bug.

Today we’re going to look at three functions that let you manipulate data frames, without the nasty side-effects of attachwith, within and transform.

For adding (or overwriting) a column to a data frame, like in the above example, any of the three functions is perfectly adequate; they just have slightly different syntaxes. with often has the most concise formulation, though there isn’t much in it.

anorexia$wtDiff <- with(anorexia, Postwt - Prewt)
anorexia <- within(anorexia, wtDiff2 <- Postwt - Prewt)
anorexia <- transform(anorexia, wtDiff3 = Postwt - Prewt)

For multiple changes to the data frame, all three functions can still be used, but now the syntax for with is more cumbersome. I tend to favour within or transform in these situations.

fahrenheit_to_celcius <- function(f) (f - 32) / 1.8
airquality[c("cTemp", "logOzone", "MonthName")] <- with(airquality, list(
  fahrenheit_to_celcius(Temp),
  log(Ozone),
  month.abb[Month]
)) 
airquality <- within(airquality, 
{
  cTemp2     <- fahrenheit_to_celcius(Temp)
  logOzone2  <- log(Ozone)
  MonthName2 <- month.abb[Month]
}) 
airquality <- transform(airquality, 
  cTemp3     = fahrenheit_to_celcius(Temp),
  logOzone3  = log(Ozone),
  MonthName3 = month.abb[Month]
)

The most important lesson to take away from this is that if you are manipulating data frames, then with, within and transform can be used almost interchangeably, and all of them should be used in preference to attach. For further refinement, I prefer with for single updates to data frames, and within or transform for multiple updates.

Follow

Get every new post delivered to your Inbox.

Join 43 other followers