Benford’s Law and fraud in the Russian election
Earlier today Ben Goldacre posted about using Benford’s Law to try and detect fraud in the Russian elections. Read that now, or the rest of this post won’t make sense. This is a loose R translation of Ben’s Stata code.
The data is held in a Google doc. While it is possible to directly retrieve the contents with R, for a single document it is easier to save it a CSV, and load it from your own machine.
russian <- read.csv("Russian observed results - FullData.csv")
There are loads of ways of manipulating data and plotting it in R, and while you can do everything in the base R distribution, I’m going to use a few packages to make it easier.
library(reshape) library(stringr) library(ggplot2)
A little transformation is needed. We take only the columns containing the counts and manipulate the data into a “long” format with only one value per row.
russian <- melt(
russian[, c("Zhirinovsky", "Zyuganov", "Mironov", "Prokhorov", "Putin")],
variable_name = "candidate"
)
Now we add columns containing the first and last digits, extracted using regular expressions.
russian <- ddply(
russian,
.(candidate),
transform,
first.digit = str_extract(value, "[123456789]"),
last.digit = str_extract(value, "[[:digit:]]$"))
The table function gives us the counts of each number, and we compare these against the counts predicted by Benford’s Law.
first_digit_counts <- as.vector(table(russian$first.digit)) first_digit_actual_vs_expected <- data.frame( digit = 1:9, actual.count = first_digit_counts, actual.fraction = first_digit_counts / nrow(russian), benford.fraction = log10(1 + 1 / (1:9)) )
The counts of the last digit can be obtained in a similar way.
last_digit_counts <- as.vector(table(russian$last.digit))
last_digit_actual_vs_expected <- data.frame(
digit = 0:9,
count = last_digit_counts,
fraction = last_digit_counts / nrow(russian)
)
last_digit_actual_vs_expected$cumulative.fraction <- cumsum(last_digit_actual_vs_expected$fraction)
Here is the line graph…
a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit")
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
geom_line() +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(formatter = "percent") +
ylab("Counts with this first digit") +
opts(legend.position = "none")
)
and the histogram
(fig2_hist <- ggplot(russian, aes(value)) +
geom_histogram(binwidth = 20)
)



I think that the data don’t follow the distribution of Benford’s law because the numbers generated by this election don’t span several orders of magnitude. Data aggregated for regions follow Benford’s law. I tested the hypothesis with Italian data aggregated for sections and for municipalities and results are similar to Russian’s.
Piero
Agreed, and Ben’s original article says as much. It might be interesting to rerun the analysis with numbers converted to base 5, say, in order to increase the number of orders of magnitude.
Of course, we shouldn’t really expect Benford’s Law to find anything untoward, since sitting down and making up numbers for election results is less likely than ballot stuffing, or intimidating people who would have voted for the opposition, or gerrymandering, or many other ways of rigging an election.
The syntax for ggplot2 has changed. Now opts should be replaced by theme, and also the following line’s syntax has changed:
scale_y_continuous(formatter = “percent”)
I’m not sure how the new syntax works though. I expected the following to print percentages on the y axis,but it didn’t. I’m sure you’ll know how to fix this!
library(“scales”) # needed? for scale_y_continuous(labels = percent)
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
geom_line() +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(labels = percent) + #shouldn't this work?
xlab("") +
ylab("Counts with this first digit") +
theme(legend.position = "none")
)
Oh but it does work, with the new syntax.