Archive

Archive for May, 2011

Tufte article in Washington Monthly

16th May, 2011 Leave a comment

The Washington Monthly magazine has a long article about graphics guru Edward Tufte. It mostly covers his work on presenting data, with a few snippets about his powerful friends (he has advised both the Bush and Obama administrations), and his current work on recovery.gov.

I still have no idea whether his name is pronounced “Tuft” or “Tufty” or “Tooft”.

A clock utility, via console hackery

11th May, 2011 Leave a comment

A discussion on StackOverflow today shows an interesting use of special characters inside the cat function.

The most common special characters that you may have come across are the tab and newline characters, represented by \t and \n respectively. Try them for yourself.

cat("Red\tlorry\nYellow\tlorry\n")

cat also respects the backspace character, \b, and the carriage return character, \r, which means that you can delete things. Usually, this isn’t very useful, but it allows us to overwrite text in the console.

cat("ab\bc")      #\b removes the previous character
cat("abc\rde")    #\r removes everything to the start of the line

Here’s a simple clock utility, adapted from Mark, Zach and DWin’s code in the linked question, that uses this technique.

clock <- function(format = "%H:%M:%S", refresh = 1)
{
  repeat
  {
    cat("\r", format(Sys.time(), format), sep = "")
    flush.console()
    Sys.sleep(refresh)
  }
}
clock()
clock("%A %d %B %Y %I:%M:%OS3 %p", 1e-3)

Press escape to exit the clock utility. You can see a complete list of special characters over at asciitable.com.

EDIT: clock function now with customisable formatting.
ANOTHER EDIT: Refresh rate now updateable; extra example.

Tags: , ,

Friday Function: nclass

6th May, 2011 2 comments

When you draw a histogram, an important question is “how many bar should I draw?”. This should inspire an indignant response. You didn’t become a programmer to answer questions, did you? No. The whole point of programming is to let your computer do your thinking for you, giving you more time to watch videos of fluffy kittens.

Fortunately, R contains three functions to automate the answer, namely nclass.Sturges, nclass.scott and nclass.FD. (FD is short for Freedman-Diaconis; watch out for the fact that scott isn’t capitalised.)

The differences depend upon length and spread of data. For longer vectors, Scott and Freedman-Diaconis tend to give bigger answers.

short_normal <- rnorm(1e2) 
nclass.Sturges(short_normal)      #8
nclass.scott(short_normal)        #8
nclass.FD(short_normal)           #12
long_normal <- rnorm(1e5) 
nclass.Sturges(long_normal)       #18
nclass.scott(long_normal)         #111
nclass.FD(long_normal)            #144

For strongly skewed data, you are best to use some sort of transformation before you draw a histogram, but for the record, Freedman-Diaconis again gives bigger answers for highly skewed (and thus wider) vectors.

short_lognormal <- rlnorm(1e2) 
nclass.Sturges(short_lognormal)   #8
nclass.scott(short_lognormal)     #9
nclass.FD(short_lognormal)        #20
long_lognormal <- rlnorm(1e5) 
nclass.Sturges(long_lognormal)    #18
nclass.scott(long_lognormal)      #443
nclass.FD(long_lognormal)         #1134

My feeling is that since each of the three algorithms is rather dumb, it is safest to calculate all three, then pick the middle one.

nclass.all <- function(x, fun = median)
{
  fun(c(
    nclass.Sturges(x), 
    nclass.scott(x),
    nclass.FD(x)
  ))
}

log_islands 
hist(log_islands, breaks = nclass.all(log_islands))

I also wrote a MATLAB implementation of this a couple of years ago.

It is worth noting that ggplot2 doesn’t accept a number-of-bins argument to geom_histogram, because

In practice, you will need to use multiple bin widths to
discover all the signal in the data, and having bins with
meaningful widths (rather than some arbitrary fraction of the
range of the data) is more interpretable.

That’s fine if you are interactively exploring the data, but if you want a purely automated solution, then you need to make up a number of bins.

calc_bin_width <- function(x, ...)
{
  rangex <- range(x, na.rm = TRUE)
  (rangex[2] - rangex[1]) / nclass.all(x, ...)
}

p <- ggplot(movies, aes(x = votes)) +
  geom_histogram(binwidth = calc_bin_width(log10(movies$votes))) + 
  scale_x_log10()
p
Tags: , ,
Follow

Get every new post delivered to your Inbox.

Join 228 other followers