Archive

Author Archive

Benford’s Law and fraud in the Russian election

5th March, 2012 3 comments

Earlier today Ben Goldacre posted about using Benford’s Law to try and detect fraud in the Russian elections. Read that now, or the rest of this post won’t make sense. This is a loose R translation of Ben’s Stata code.

The data is held in a Google doc. While it is possible to directly retrieve the contents with R, for a single document it is easier to save it a CSV, and load it from your own machine.

russian <- read.csv("Russian observed results - FullData.csv")

There are loads of ways of manipulating data and plotting it in R, and while you can do everything in the base R distribution, I’m going to use a few packages to make it easier.

library(reshape)
library(stringr)
library(ggplot2)

A little transformation is needed. We take only the columns containing the counts and manipulate the data into a “long” format with only one value per row.

russian <- melt(
    russian[, c("Zhirinovsky", "Zyuganov", "Mironov", "Prokhorov", "Putin")], 
    variable_name = "candidate"
)

Now we add columns containing the first and last digits, extracted using regular expressions.

russian <- ddply(
    russian, 
    .(candidate), 
    transform, 
    first.digit = str_extract(value, "[123456789]"),
    last.digit  = str_extract(value, "[[:digit:]]$"))

The table function gives us the counts of each number, and we compare these against the counts predicted by Benford’s Law.

first_digit_counts <- as.vector(table(russian$first.digit))
first_digit_actual_vs_expected <- data.frame(
  digit            = 1:9,
  actual.count     = first_digit_counts,    
  actual.fraction  = first_digit_counts / nrow(russian),
  benford.fraction = log10(1 + 1 / (1:9))
)

The counts of the last digit can be obtained in a similar way.

last_digit_counts <- as.vector(table(russian$last.digit))
last_digit_actual_vs_expected <- data.frame(
    digit     = 0:9,
    count     = last_digit_counts,    
    fraction  = last_digit_counts / nrow(russian)
)
last_digit_actual_vs_expected$cumulative.fraction <- cumsum(last_digit_actual_vs_expected$fraction)

Here is the line graph…

a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit")
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
    geom_line() +
    scale_x_continuous(breaks = 1:9) +
    scale_y_continuous(formatter = "percent") +
    ylab("Counts with this first digit") +
    opts(legend.position = "none")
)

Fig 1. Actual percentages of first digits vs. those predicted by Benford's Law

and the histogram

(fig2_hist <- ggplot(russian, aes(value)) +
    geom_histogram(binwidth = 20)
)

Fig 2. Histogram of vote counts in the Russian election

Radical Statistics was radical

25th February, 2012 Leave a comment

Today I went to the Radical Statistics conference in London. RadStats was originally a sort of left wing revolutionary group for statisticians, but these days the emphasis is on exposing dubious statistics by companies and politicians.

Here’s a quick rundown of the day.

First up Roy Carr-Hill spoke about the problems with trying to collect demographic data and estimating soft measures of societal progress like wellbeing. (Household surveys exclude people not in households, like the homeless soldiers and old people in care homes; and English people claim to be 70% satisfied regardless of the question.)

Next was Val Saunders who started with a useful debunking of done methodological flaws in schizophrenia research, then blew it by detailing her own methodologically flaws research and making overly strong claims to have found the cause of that disease.

Aubrey Blunsohn and David Healy both talked about ways that the pharmaceutical industry fudges results. The list was impressively long, leading me to suspect that far to many people have spent far too long thinking of ways to game the system. The two main recommendations that resonated with me were to extend the trials register to phase 1 trials to avoid unfavourable studies being buried and for raw data to be made available for transparent analysis. Pipe dreams.

After lunch Prem Sikka pointed out that tax avoidance isn’t just shady companies trying to scam the system, but actually accountancy firms pay people to dream up new wheezes and sell them to those companies.

Ann Pettifor and final speaker Howard Reed had similar talks evangelising Keynesian stimulus (roughly, big government spending in times of recession) for the UK economy amongst some economic myth debunking. Thought provoking, though both speakers neglected to mention the limitations of such stimuli – you have to avoid spending in pork barrel nonsense (see Japan in the 90s, that buy-a-banger scheme in the UK in 2009) and you have to find a ways to turn of the taps w when recession is over.

The other speaker was Allyson Pollack who discussed debunking a dubious study by Zac Cooper claiming that patients being allowed to choose their surgeon improved success rated treating acute myocardial infarction. Such patients are generally unconscious while having their heart attack so out was inevitably nonsense.

Overall a great day.

GUI building in R: gWidgets vs Deducer

20th February, 2012 4 comments

I’ve been a user (and fan) of gWidgets for a couple of years now for GUI building in R. (See my introduction to it here.) However, it’s always good to check out the competition so I’ve been playing around with Deducer to see how they compare.

R can access a number of GUI building frameworks including tcltk, GTK, qt, and Java, not to mention HTML. gWidgets’ big selling point is that is provides a high-level wrapper to all the R wrappers for each framework, so you can write code in a toolkit independent way. Switching between tcltk and GTK and qt won’t often be that useful, but if you think you might want to move from a desktop based GUI to a web app, it makes the transition easier. By contrast, Deducer based upon the rJava, and provides access to the Java Swing framework. It’s a slightly lower level library (which means you have to write more lines of code to achieve the same thing), but since you get full access to Swing, it’s a little more flexible. Deducer also has some features to integrate your GUIs with JGR, so if you use that for running R, it’s perhaps the most natural choice.

To test the two frameworks, I wrote a small GUI for running the Kolmogorov-Smirnoff test (that one of the ones for checking whether or not a variable seems to have been sampled from a particular distribution). Take a look at the code below to see the comparison. (Regular reader may notice I’ve switched from my usual under_casing to camelCasing. Both the frameworks use this style, so I thought I’d follow suit for cleanliness.)

First, here are some common variables (labels and the like).

#Some sample data to test against
x1 <- rnorm(100)
x2 <- runif(100)

#Widget labels
labelX <- "Variable name for data: "
labelY <- "Distribution to compare to: "
labelAlternative <- "One or two sided test?: "
labelP <- "The p-value is: "

#Choices for comboboxes
choicesAlternative <- eval(formals(ks.test)$alternative)
distributions <- c(
    normal = pnorm, 
    exponential = pexp,
    F = pf,
    "log-normal" = plnorm,
    "Student's t" = pt,
    uniform = punif
)

This is the gWidgets GUI

createKsTestGwidgets <- function()
{
  library(gWidgetstcltk)
  options(guiToolkit = "tcltk")
  win <- gwindow("KS Test, gWidgets edition", visible = FALSE)
  
  frmX <- gframe("x", container = win)
  lblX <- glabel(labelX, container = frmX)
  txtX <- gedit(container = frmX)
  
  frmY <- gframe("y", container = win)
  lblY <- glabel(labelY, container = frmY)
  cmbY <- gcombobox(names(distributions), container = frmY)
  
  frmAlternative <- gframe("alternative", container = win)
  lblAlternative <- glabel(labelAlternative, container = frmAlternative)
  cmbAlternative <- gcombobox(choicesAlternative, container = frmAlternative)
  
  btnCalc <- gbutton("Calculate", container = win,
      handler = function(h, ...)
      {
        x <- get(svalue(txtX), mode = "numeric")
        y <- distributions[[svalue(cmbY)]]
        alternative <- svalue(cmbAlternative)
        ans <- ks.test(x, y, alternative = alternative)
        svalue(txtP) <- format(ans$p.value, digits = 3)
      }
  )
  frmResults <- gframe("results", container = win)
  lblP <- glabel(labelP, container = frmResults)
  txtP <- gedit(container = frmResults)
  visible(win) <- TRUE
  
}

createKsTestGwidgets()

…and here’s the Deducer equivalent.

createKsTestDeducer <- function()
{
  library(Deducer)
  win <- new(RDialog)
  win$setSize(300L, 500L)
  win$setTitle("KS TEST, Deducer edition")
  
  JLabel <- J("javax.swing.JLabel")
  lblX <- new(JLabel, labelX)
  addComponent(win, lblX, 1, 1000, 50, 1, rightType = "REL")
  txtX <- new(TextAreaWidget, "x")
  addComponent(win, txtX, 51, 1000, 150, 1, rightType = "REL")
  
  lblY <- new(JLabel, labelY)
  addComponent(win, lblY, 151, 1000, 200, 1, rightType = "REL")
  
  cmbY <- new(ComboBoxWidget, names(distributions))
  cmbY$setDefaultModel(names(distributions)[1])
  addComponent(win, cmbY, 201, 1000, 300, 1, rightType = "REL")
  
  lblAlternative <- new(JLabel, labelAlternative)
  addComponent(win, lblAlternative, 301, 1000, 400, 1, rightType = "REL")
  
  cmbAlternative <- new(ComboBoxWidget, choicesAlternative)
  cmbAlternative$setDefaultModel(choicesAlternative[1])
  addComponent(win, cmbAlternative, 401, 1000, 500, 1, rightType = "REL")
  
  JButton <- J("javax.swing.JButton")
  btnCalc <- new(JButton, "Calculate")
  addComponent(win, btnCalc, 501, 1000, 601, 1, rightType = "REL")
  ActionListener <- J("org.rosuda.deducer.widgets.event.RActionListener")
  listener <- new(ActionListener)
  calculationHandler <- function(cmd, ActionEvent)
  {
    x <- get(txtX$getText())
    y <- distributions[[cmbY$getModel()]]
    alternative <- cmbAlternative$getModel()
    ans <- ks.test(x, y, alternative = alternative)
    print(ans)
    txtP$setText(format(ans$p.value, digits = 3))
  }
  listener$setFunction(toJava(calculationHandler))
  btnCalc$addActionListener(listener)
  
  lblP <- new(JLabel, labelP)
  addComponent(win, lblP, 601, 1000, 650, 1, rightType = "REL")
  
  txtP <- new(TextAreaWidget, "results")
  addComponent(win, txtP, 651, 1000, 750, 1, rightType = "REL")
    
  win$run()
}

createKsTestDeducer()

Note that the Deducer example works perfectly under JGR, though I couldn’t get the button handler to fire when running it from eclipse. This is likely due to my inexperience with the toolkit rather than a fundamental problem with the framework. Many of the lines are more or less a one-to-one comparison, but Deducer requires you to explicitly specify positions of widgets, and is a little more verbose when you come to add event handling logic.

Either or these frameworks is suitable for the obvious use case of GUI building in R (rapid prototyping of front-ends for non technical users, and for teaching demos), so don’t sweat your decision too much.

Edit: Fixed variable name casing issues.

R hits 10000 questions on stackoverflow

17th February, 2012 Leave a comment

R's 10000th question on stackoverflow

A milestone, though not that exciting as questions go. Still, if you haven’t yet joined the cult of Stack Exchange, take a look here.

Tags: ,

Viewing the internals of MATLAB Matrices

31st January, 2012 Leave a comment

A cool undocumented trick I just learnt from The MathWorks’ Bob Gilmore. If you type

format debug

Then printing any vector reveals information about its internal representation. For example:

x = magic(3)

x =


Structure address = 6bc1ab0 
m = 3
n = 3
pr = d8dccf0 
pi = 0
     8     1     6
     3     5     7
     4     9     2

The structure address is the address in memory where the matrix is stored, m and n are the number of rows and columns respectively of the matrix, and pr and pi are pointers to the addresses of the matrices storing the real and imaginary components of the matrix.

One interesting thing to look at is the representation of scalar numbers.

 y = 1

y =


Structure address = 6bc31e0 
m = 1
n = 1
pr = d790b90 
pi = 0
     1

Yep: they are stored in exactly the same way as matrices: in the same way the “everything in R is a vector”, everything in MATLAB is a matrix. To finish up, here are some more examples for you to explore:

% higher dimensional arrays
rand(2, 3, 4)
% cell arrays (unfortunately not that revealing)
{1, magic(3)}
% sparse matrices (very interesting)
sparse(ones(3))

Exploring the functions in a package

26th January, 2012 4 comments

Sometimes it can be useful to list all the functions inside a package. This is done in the same way that you would list variables in your workspace. That is, using ls. The syntax is ls(pos = "package:packagename"), which is easy enough if you can remember it. Unfortunately, I never can, and have to type search() first to see what the format of that string is.

Today, that problem is solved with a tiny utility function to save remembering things, and to save typing.

lsp <- function(package, all.names = FALSE, pattern) 
{
  package <- deparse(substitute(package))
  ls(
      pos = paste("package", package, sep = ":"), 
      all.names = all.names, 
      pattern = pattern
  )
}

all.names and pattern behave in the same way as they do in regular ls. You use it like this:

lsp(base)
lsp(base, TRUE)
lsp(base, pattern = "^is")


EDIT: I’ve had a couple of questions about the use case, and there are some interesting comments on alternatives. My thinking behind this function was that I sometimes know I’ve seen a function in a package but can’t remember what it’s called. If you can hazard a guess at the name, then apropos is probably better, though it looks everywhere on the search path rather than in a particular package. Autocompletion is also useful for this, but you need to know the first few characters of what you are looking for. (Activate autocompletions by pressing TAB in R GUI or Rstudio or CTRL+space in eclipse. I can’t remember what the shortcut is in emacs, but you probably just mash CTRL+META until you have RSI.) Finally, the unknownR package is useful for finding new functions that you hadn’t heard of yet.

Adding metadata to variables

6th January, 2012 Leave a comment

There are only really two ways to preserve your statistical analyses. You either save the variables that you create, or you save the code that you used to create them. In general the latter is much preferred because at some point you’ll realise that your model was wrong, or your dataset has changed, and you need to re-run your analysis. If you only stored your variables then you are now stuck rewriting your code in order to create new versions, which is really not fun. On the other hand, if you saved your code, all your have to do is tweak it and run it.

Occasionally though, just keeping the code and rerunning an analysis isn’t practical. The most obvious case being when it takes a long time. If your model takes more than ten minutes to run, it can be really useful to save its variables as well as the source code.

The problem with saving variables is that when you come back and load them six months later, it isn’t always obvious what they are or where they came from. With code, we solve this by using comments to jog our memory, so it would be nice to have an equivalent for variables. In fact, in R, such a facility exists with the – you guessed it – comment function.

library(lattice)
comment(barley) <- "Immer's barley data, 1934.  The data from the Morris site may have the wrong years."
comment(barley)

The comment function simply stores the string as an attribute of the variable, with some special rules on printing. Other common attributes that you may be familiar with are names for vectors and lists, and dim and dimnames for matrices.

You can find the names of all the attributes of a variable with the attributes function, and get and set individual attributes with attr.

x <- c(apple = 1, banana = 2)
attr(x, "type") <- "fruit"
attributes(x)
attr(x, "names") #same as names(x)

Attributes are really great for storing contextual metadata about a variable. For starters, when you come back to your saved workspace after those six months you might want to know who created the variable and when. To get this facility, we need an enhanced version of assign.

get_user <- function()
{
  env <- if(.Platform$OS.type == "windows") "USERNAME" else "USER"
  unname(Sys.getenv(env))    
}  
  
assign_with_metadata <- function(x, value, ..., pos = parent.frame(), inherits = FALSE)
{
  attr(value, "creator") <- get_user()
  attr(value, "time_created") <- Sys.time()
  more_attr <- list(...)
  attr_names <- names(more_attr)
  for(i in seq_along(more_attr))
  {
    attr(value, attr_names[i]) <- more_attr[[i]]
  }
  assign(x, value, pos = pos, inherits = inherits)
}

assign_with_metadata("x", 1:3, monkey = "chimp")

Notice the ... that allows you to add arbitrary attributes to the variable.

While this is great, and solves the problem, typing assign_with_metadata is way too clunky. It would be much easier if we could just use <- to assign variables and get the metadata for free.

Actually, overriding <- itself is going to lead to slowness and likely errors. Since we don’t want to store metadata for every variable (just the important ones), it is better to define our own operators to do so.

`%<-%` <- function(x, value)
{
  xname <- deparse(substitute(x))
  pos <- parent.frame()
  assign_with_metadata(xname, value, pos = pos)
}

`%<<-%` <- function(x, value) 
{
  xname <- deparse(substitute(x))
  pos <- globalenv()
  assign_with_metadata(xname, value, pos = pos)
}

m %<-% "foo"    #local assignment with metadata
f <- function()
{
  n %<<-% "bar" #global assignment with metadata
}
f()

With these functions, if you want to save your variables for later, simply swap <- for %<-%.

A quick primer on split-apply-combine problems

16th December, 2011 5 comments

I’ve just answered my hundred billionth question on Stack Overflow that goes something like

I want to calculate some statistic for lots of different groups.

Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.

First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.

head(InsectSprays)
  count spray
1    10     A
2     7     A
3    20     A
4    14     A
5    14     A
6    12     A

These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.

First, we split the count column by the spray column.

(count_by_spray <- with(InsectSprays, split(count, spray)))

Secondly, we apply the statistic to each element of the list. Lets use the mean here.

(mean_by_spray <- lapply(count_by_spray, mean))

Finally, (if possible) we recombine the list as a vector.

unlist(mean_by_spray)

This procedure is such a common thing that there are many functions to speed up the process. sapply and vapply do the last two steps together.

sapply(count_by_spray, mean)
vapply(count_by_spray, mean, numeric(1))

We can do even better than that however. tapply, aggregate and by all provide a one-function solution to these S-A-C problems.

with(InsectSprays, tapply(count, spray, mean))
with(InsectSprays, by(count, spray, mean))
aggregate(count ~ spray, InsectSprays, mean)

The plyr package also provides several solutions, with a choice of output format. ddply takes a data frame and returned another data frame, which is what you’ll want most of the time. dlply takes a data frame and returns the uncombined list, which is useful if you want to do another processing step before combining.

ddply(InsectSprays, .(spray), summarise, mean.count = mean(count))
dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))

You can read much more on this type of problem and the plyr solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.

One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the ave function (which provides mean as the default function).

with(InsectSprays, ave(count, spray))

MATLAB’s stand out new feature

6th October, 2011 Leave a comment

It’s been a while since my last MATLAB post, not because I don’t love the language, but more because I do most of my blogging from home, where I have no license, and because (mostly thanks to R-bloggers) I get ten times as many page views for the R posts. (TODO: Create MATLAB-bloggers service.)

Having returned from holiday (it was lovely, thanks for asking) I’ve been trying out the latest release of MATLAB – R2011b. So far, the standout new feature is the automatic variable renaming. If you change the name of a variable at the point where it was declared, then pressing Shift+Enter lets MATLAB rename all other instances. IDEs for statically-typed languages have had this feature for years, but to see it in a dynamically-typed language is very impressive.

MATLAB's variable auto-renaming in action

A Great European Bailout Infographic

8th September, 2011 1 comment

Whenever there’s a financial crisis, I tend to assume that it’s Dirk Eddelbuettel‘s fault, though apparently the EMU debt crisis is more complicated than that. JP Morgan have just released a white paper about the problem, including a Lego infographic of who is asking who for money. Created, apparently, by Peter Cembalest, aged nine. Impressive stuff.

European bailout graph

Found via The Register.

Follow

Get every new post delivered to your Inbox.

Join 43 other followers