Archive
A brainfuck interpreter for R
The deadline for my book on R is fast approaching, so naturally I’m in full procrastination mode. So much so that I’ve spent this evening creating a brainfuck interpreter for R. brainfuck is a very simple programming language: you get an array of 30000 bytes, an index, and just 8 eight commands. You move the index left or right along the array with < and >; increase or decrease the value at the current position with + and -; read and write characters using . and ,; and start and end loops with [ and ].
There seem to be two approaches to creating a brainfuck interpreter: directly execute the commands, or generate code in a sensible language and execute that. I’ve opted for the latter approach because it’s easier, at least in R. Generating R code and then calling eval is probably a little slower than directly executing commands, but that’s the least of your worries with brainfuck. Even writing a trivial page-long program will take you many million times longer than it takes to execute.
The fact that you have to mix data variables (that 30000 element raw vector and an index) with commands means that an object oriented approach is useful. The whole interpreter is stored in a single reference class, of type brainfuck. Rather than me showing you all the code here, I suggest that you take a look at it (or clone it) from its repository on bitbucket. (I’ll submit to CRAN soon.)
Here’s a Hello World example taken from Wikipedia. To use the brainfuck package, you just create/import your brainfuck program as a character vector (non-command characters are ignored, so you can comment your code). Call fuckbrain once to create the interpreter variable, then call its interpret method on each program that you want to run.
library(brainfuck)
hello_world <- "+++++ +++++ initialize counter (cell #0) to 10
[ use loop to set the next four cells to 70/100/30/10
> +++++ ++ add 7 to cell #1
> +++++ +++++ add 10 to cell #2
> +++ add 3 to cell #3
> + add 1 to cell #4
<<<< - decrement counter (cell #0)
]
> ++ . print 'H'
> + . print 'e'
+++++ ++ . print 'l'
. print 'l'
+++ . print 'o'
> ++ . print ' '
<< +++++ +++++ +++++ . print 'W'
> . print 'o'
+++ . print 'r'
----- - . print 'l'
----- --- . print 'd'
> + . print '!'
> . print '\n'"
bfi <- fuckbrain()
bfi$interpret()
Have my old job!
My old job at the Health & Safety Laboratory is being advertised, and at a higher pay grade to boot. (Though it is still civil service pay, and thus not going to make you rich.)
You’ll need to have solid mathematical modelling skills, particularly solving systems of ODEs, and be proficient at writing scientific code, preferably R or MATLAB or acslX. From chats with a few people at the lab, management are especially keen to get someone who can bring in money so grant writing and blagging skills are important too.
It’s a smashing place to work and the people are lovely. Also, you get flexitime and loads of holiday. If you are looking for a maths job in North West* England then I can heartily recommend applying.
*Buxton is sometimes North West England (when we get BBC local news) and sometimes in the East Midlands (like when we vote in European elections).
Make your data famous!
I’m writing a book on R for O’Reilly, and I need interesting datasets for the examples. Any data that you provide will get you a mention in the book and in the publicity material, so it’s a great opportunity to publicise your work or your organisation.
Datasets from any area or industry are suitable; the only constraint is that it can be analysed with a few pages of R code to provide a result that a general reader might go “ooh”. There’s a chapter on data cleaning, so even dirty data is suitable!
All the data will be provided in an R package to accompany the book, so you need to be willing to make it publically available. I can help you anonymise the data, or strip out commercially sensitive parts if you require.
If you can provide anything, or you know someone who might be able to, then drop me an email at richierocks AT gmail DOT com. Thanks.
EDIT: There are some (quite) frequently asked questions already! Here are the answers; you can use your Jeopardy! skills to guess the questions.
1. The book is called “Learning R”, and it’s a fairly gentle introduction to the language, covering both how you program in R, and how you analyse data.
2. If you provide data, then yes, you can have an PDF of the pre-release version to make sure I haven’t done something silly with your dataset.
How long does it take to get pregnant?
My girlfriend’s biological clock is ticking, and so we’ve started trying to spawn. Since I’m impatient, that has naturally lead to questions like “how long will it take?”. If I were to believe everything on TV, the answer would be easy: have unprotected sex once and pregnancy is guaranteed.
A more cynical me suggests that this isn’t the case. Unfortunately, it is surpisingly difficult to find out the monthly chance of getting pregnant (technical jargon: the “monthly fecundity rate”, or MFR), given that you are having regular sex in the days leading up to ovulation. Everyone agrees that age has a big effect, with women’s peak fertility occuring somewhere around the age of 25. Beyond that point, the internet is filled with near-useless summary statistics like the chance of conceiving after one year. For example, the usually reliable NHS site says
Women become less fertile as they get older. For women aged 35, about 94 out of every 100
who have regular unprotected sex will get pregnant after three years of trying. However, for
women aged 38, only 77 out of every 100 will do so.
I found a couple of reasonably sciency links(George and Kamath, Socal Fertility) that suggest that the MFR is about 25% for a women aged 25, and 10% at age 35. The Scoal link also gives rates of 15% at age 30, 5% at age 40 and less than 1% at age 45. If the woman is too fat, too thin, a smoker, or has hormone problems, or is stressed, then the rate needs reducing.
Given the MFR, the probability of getting pregnant after a given number of months can be calculated with a negative binomial distribution.
months <- 0:60
p_preg_per_month <- c("25" = 0.25, "30" = 0.15, "35" = 0.1, "40" = 0.05, "45" = 0.01)
p_success <- unlist(lapply(
p_preg_per_month,
function(p) pnbinom(months, 1, p)
))
Now we just create a data frame suitable for passing to ggplot2 …
mfr_group <- paste( "MFR =", format(p_preg_per_month, digits = 2), "at age", names(p_preg_per_month) ) mfr_group <- factor(mfr_group, levels = mfr_group) preg_data <- data.frame( months = rep.int(months, length(mfr_group)) , mfr_group = rep(mfr_group, each = length(months)), p_success = p_success )
and draw the plot.
library(ggplot2)
(p <- ggplot(preg_data, aes(months, p_success, colour = mfr_group)) +
geom_point() +
scale_x_continuous(breaks = seq.int(0, 60, 12)) +
scale_y_continuous(breaks = seq.int(0, 1, 0.1), limits = c(0, 1)) +
scale_colour_discrete("Monthly fecundity rate") +
xlab("Months") +
ylab("Probability of conception") +
opts(panel.grid.major = theme_line(colour = "grey60"))
)
So almost half of the (healthy) 25 year olds get pregnant in the first monthtwo months, and after two years (the point when doctors start considering you to have fertility problems) more than 90% of 35 year olds should conceive. By contrast, just over 20% of 45 year old women will. In fact, even this statistic is over-optimistic: at this age, fertility is rapidly decreasing, and a 1% MFR at age 45 will mean a much lower MFR at age 47 and the negative binomial model breaks down.
Of course, from a male point of view, conception is an embarrassingly parallel problem: you can dramatically reduce the time to conceive a child by sleeping with lots of women at once. (DISCLAIMER: Janette, if you’re reading this, I’m not practising or advocating this technique!)
Be assertive!
assertive, my new package for writing robust code, is now on CRAN. It consists of lots of is functions for checking variables, and corresponding assert functions that throw an error if the condition doesn’t hold. For example, is_a_number checks that the input is numeric and scalar.
is_a_number(1) #TRUE
is_a_number("a") #FALSE
is_a_number(1:10) #FALSE
In the last two cases, the return value of FALSE has an attribute “cause” that indicates the cause of failure. When “a” is the input, the cause is “"a" is not of type 'numeric'.“, whereas for 1:10, the cause is “1:10 does not have length one.“. You can get or set the cause attribute with the cause function.
m <- lm(uptake ~ 1, CO2) ok <- is_empty_model(m) if(!ok) cause(ok)
The assert functions call an is function, and if the result is FALSE, they throw an error; otherwise they do nothing.
assert_is_a_number(1) #OK
assert_is_a_number("a") #Throws an error
There are also some has functions, primarily for checking the presence of attributes.
has_names(c(foo = 1, bar = 4, baz = 9)) has_dims(matrix(1:12, nrow = 3))
Some functions apply to properties of vectors. In this case, the assert functions can check that all the values conform to the condition, or any of the values conform.
x <- -2:2 is_positive(x) #The last two are TRUE assert_any_are_positive(x) #OK assert_all_are_positive(x) #Error
“Why would you want to use these functions?”, you may be asking. The dynamic typing and extreme flexibility of R means that it is very easy to have variables that are the wrong format. This is particularly true when you are dealing with user input. So while you know that the sales totals passed to your function should be a vector of non-negative numbers, or that the regular expression should be a single string rather than a character vector, your user may not. You need to check for these invalid conditions, and return an error message that the user can understand. assertive makes it easy to do all this.
Since this is the first public release of assertive, it hasn’t been widely tested. I’ve written a moderately comprehensive unit-test suite, but there are likely to be a few minor bugs here and there. In particular, I suspect there may be one or two typos in the documentation. Please give the package a try, and let me know if you find any errors, or if you want any other functions adding.
Benford’s Law and fraud in the Russian election
Earlier today Ben Goldacre posted about using Benford’s Law to try and detect fraud in the Russian elections. Read that now, or the rest of this post won’t make sense. This is a loose R translation of Ben’s Stata code.
The data is held in a Google doc. While it is possible to directly retrieve the contents with R, for a single document it is easier to save it a CSV, and load it from your own machine.
russian <- read.csv("Russian observed results - FullData.csv")
There are loads of ways of manipulating data and plotting it in R, and while you can do everything in the base R distribution, I’m going to use a few packages to make it easier.
library(reshape) library(stringr) library(ggplot2)
A little transformation is needed. We take only the columns containing the counts and manipulate the data into a “long” format with only one value per row.
russian <- melt(
russian[, c("Zhirinovsky", "Zyuganov", "Mironov", "Prokhorov", "Putin")],
variable_name = "candidate"
)
Now we add columns containing the first and last digits, extracted using regular expressions.
russian <- ddply(
russian,
.(candidate),
transform,
first.digit = str_extract(value, "[123456789]"),
last.digit = str_extract(value, "[[:digit:]]$"))
The table function gives us the counts of each number, and we compare these against the counts predicted by Benford’s Law.
first_digit_counts <- as.vector(table(russian$first.digit)) first_digit_actual_vs_expected <- data.frame( digit = 1:9, actual.count = first_digit_counts, actual.fraction = first_digit_counts / nrow(russian), benford.fraction = log10(1 + 1 / (1:9)) )
The counts of the last digit can be obtained in a similar way.
last_digit_counts <- as.vector(table(russian$last.digit))
last_digit_actual_vs_expected <- data.frame(
digit = 0:9,
count = last_digit_counts,
fraction = last_digit_counts / nrow(russian)
)
last_digit_actual_vs_expected$cumulative.fraction <- cumsum(last_digit_actual_vs_expected$fraction)
Here is the line graph…
a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit")
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
geom_line() +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(formatter = "percent") +
ylab("Counts with this first digit") +
opts(legend.position = "none")
)
and the histogram
(fig2_hist <- ggplot(russian, aes(value)) +
geom_histogram(binwidth = 20)
)
R hits 10000 questions on stackoverflow
A milestone, though not that exciting as questions go. Still, if you haven’t yet joined the cult of Stack Exchange, take a look here.
Exploring the functions in a package
Sometimes it can be useful to list all the functions inside a package. This is done in the same way that you would list variables in your workspace. That is, using ls. The syntax is ls(pos = "package:packagename"), which is easy enough if you can remember it. Unfortunately, I never can, and have to type search() first to see what the format of that string is.
Today, that problem is solved with a tiny utility function to save remembering things, and to save typing.
lsp <- function(package, all.names = FALSE, pattern)
{
package <- deparse(substitute(package))
ls(
pos = paste("package", package, sep = ":"),
all.names = all.names,
pattern = pattern
)
}
all.names and pattern behave in the same way as they do in regular ls. You use it like this:
lsp(base) lsp(base, TRUE) lsp(base, pattern = "^is")
EDIT: I’ve had a couple of questions about the use case, and there are some interesting comments on alternatives. My thinking behind this function was that I sometimes know I’ve seen a function in a package but can’t remember what it’s called. If you can hazard a guess at the name, then apropos is probably better, though it looks everywhere on the search path rather than in a particular package. Autocompletion is also useful for this, but you need to know the first few characters of what you are looking for. (Activate autocompletions by pressing TAB in R GUI or Rstudio or CTRL+space in eclipse. I can’t remember what the shortcut is in emacs, but you probably just mash CTRL+META until you have RSI.) Finally, the unknownR package is useful for finding new functions that you hadn’t heard of yet.





