Home > R > Benford’s Law and fraud in the Russian election

Benford’s Law and fraud in the Russian election

Earlier today Ben Goldacre posted about using Benford’s Law to try and detect fraud in the Russian elections. Read that now, or the rest of this post won’t make sense. This is a loose R translation of Ben’s Stata code.

The data is held in a Google doc. While it is possible to directly retrieve the contents with R, for a single document it is easier to save it a CSV, and load it from your own machine.

russian <- read.csv("Russian observed results - FullData.csv")

There are loads of ways of manipulating data and plotting it in R, and while you can do everything in the base R distribution, I’m going to use a few packages to make it easier.

library(reshape)
library(stringr)
library(ggplot2)

A little transformation is needed. We take only the columns containing the counts and manipulate the data into a “long” format with only one value per row.

russian <- melt(
    russian[, c("Zhirinovsky", "Zyuganov", "Mironov", "Prokhorov", "Putin")], 
    variable_name = "candidate"
)

Now we add columns containing the first and last digits, extracted using regular expressions.

russian <- ddply(
    russian, 
    .(candidate), 
    transform, 
    first.digit = str_extract(value, "[123456789]"),
    last.digit  = str_extract(value, "[[:digit:]]$"))

The table function gives us the counts of each number, and we compare these against the counts predicted by Benford’s Law.

first_digit_counts <- as.vector(table(russian$first.digit))
first_digit_actual_vs_expected <- data.frame(
  digit            = 1:9,
  actual.count     = first_digit_counts,    
  actual.fraction  = first_digit_counts / nrow(russian),
  benford.fraction = log10(1 + 1 / (1:9))
)

The counts of the last digit can be obtained in a similar way.

last_digit_counts <- as.vector(table(russian$last.digit))
last_digit_actual_vs_expected <- data.frame(
    digit     = 0:9,
    count     = last_digit_counts,    
    fraction  = last_digit_counts / nrow(russian)
)
last_digit_actual_vs_expected$cumulative.fraction <- cumsum(last_digit_actual_vs_expected$fraction)

Here is the line graph…

a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit")
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
    geom_line() +
    scale_x_continuous(breaks = 1:9) +
    scale_y_continuous(formatter = "percent") +
    ylab("Counts with this first digit") +
    opts(legend.position = "none")
)

Fig 1. Actual percentages of first digits vs. those predicted by Benford's Law

and the histogram

(fig2_hist <- ggplot(russian, aes(value)) +
    geom_histogram(binwidth = 20)
)

Fig 2. Histogram of vote counts in the Russian election

About these ads
  1. Piero
    7th March, 2012 at 8:38 am

    I think that the data don’t follow the distribution of Benford’s law because the numbers generated by this election don’t span several orders of magnitude. Data aggregated for regions follow Benford’s law. I tested the hypothesis with Italian data aggregated for sections and for municipalities and results are similar to Russian’s.

    Piero

    • 7th March, 2012 at 11:08 am

      Agreed, and Ben’s original article says as much. It might be interesting to rerun the analysis with numbers converted to base 5, say, in order to increase the number of orders of magnitude.

      Of course, we shouldn’t really expect Benford’s Law to find anything untoward, since sitting down and making up numbers for election results is less likely than ballot stuffing, or intimidating people who would have voted for the opposition, or gerrymandering, or many other ways of rigging an election.

  2. Patrick (I don't really use this email)
    21st March, 2013 at 7:13 am

    The syntax for ggplot2 has changed. Now opts should be replaced by theme, and also the following line’s syntax has changed:

    scale_y_continuous(formatter = “percent”)

    I’m not sure how the new syntax works though. I expected the following to print percentages on the y axis,but it didn’t. I’m sure you’ll know how to fix this!

    library(“scales”) # needed? for scale_y_continuous(labels = percent)
    (fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
    geom_line() +
    scale_x_continuous(breaks = 1:9) +
    scale_y_continuous(labels = percent) + #shouldn't this work?
    xlab("") +
    ylab("Counts with this first digit") +
    theme(legend.position = "none")
    )

    • Patrick
      21st March, 2013 at 7:14 am

      Oh but it does work, with the new syntax.

  1. 8th March, 2012 at 17:02 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 217 other followers

%d bloggers like this: