Archive for April, 2011

Friday function triple bill: with vs. within vs. transform

29th April, 2011 9 comments

When you first learnt about data frames in R, I’m sure that, like me, you thought “This is a lot of hassle having to type the names of data frames over and over in order to access each column”.

anorexia$wtDiff <- anorexia$Postwt - anorexia$Prewt #I have to type anorexia how many times?

Indeed, any time you see chunks of code repeated over and over, there’s an indication that they need rewriting. Thus the first time you discovered the attach function was a blissful moment. Ah, the hours you would save by not typing variable names! Alas, those hours were more than made up for by the hundreds of hours you spent debugging impenetrable buggy code that was a side effect of attach.

anorexia$wtDiff <- Postwt - Prew   #Deliberate typo!

In the above snippet of code, the typo causes execution to halt after the second line, so the call to detach never happens. Then, after you’ve fixed the typo and run the code again, anorexia is on your search path twice. This is problematic because when you detach it, there is still a copy of the data frame on the search path. Cue wailing and gnashing of teeth as you waste half an hour trying to find the bug.

Today we’re going to look at three functions that let you manipulate data frames, without the nasty side-effects of attachwith, within and transform.

For adding (or overwriting) a column to a data frame, like in the above example, any of the three functions is perfectly adequate; they just have slightly different syntaxes. with often has the most concise formulation, though there isn’t much in it.

anorexia$wtDiff <- with(anorexia, Postwt - Prewt)
anorexia <- within(anorexia, wtDiff2 <- Postwt - Prewt)
anorexia <- transform(anorexia, wtDiff3 = Postwt - Prewt)

For multiple changes to the data frame, all three functions can still be used, but now the syntax for with is more cumbersome. I tend to favour within or transform in these situations.

fahrenheit_to_celcius <- function(f) (f - 32) / 1.8
airquality[c("cTemp", "logOzone", "MonthName")] <- with(airquality, list(
airquality <- within(airquality, 
  cTemp2     <- fahrenheit_to_celcius(Temp)
  logOzone2  <- log(Ozone)
  MonthName2 <-[Month]
airquality <- transform(airquality, 
  cTemp3     = fahrenheit_to_celcius(Temp),
  logOzone3  = log(Ozone),
  MonthName3 =[Month]

The most important lesson to take away from this is that if you are manipulating data frames, then with, within and transform can be used almost interchangeably, and all of them should be used in preference to attach. For further refinement, I prefer with for single updates to data frames, and within or transform for multiple updates.

(Almost) Friday Function: alarm

21st April, 2011 2 comments

Last week I decided to start a weekly column detailing an interesting function each Friday, entirely forgetting that I would be on holiday, without internet access (shock horror!), tomorrow. So here’s your column a little early.

The alarm function is something of a novelty, in that all it does is to make an annoying noise when you call it. The only vaguely sensible time that I can think of that you might want to do this is when an error is thrown. Setting this up means overriding the default error handling behaviour, which is surprisingly easy.

options(error = alarm) 
stop("!!!")  # to test the behaviour

For best results, make sure you’ve unplugged your headphones and turn the volume up loud to annoy friends, family and colleagues while debugging.

(To restore the default behaviour, use options(error = NULL).)

supercalifragilisticexpialidocious = 1

21st April, 2011 2 comments

I notice that the latest version of R has upped the maximum length of variable names from 256 characters to a whopping 10 000! (See ?name.) It makes the 63 character limit in MATLAB look rather pitiful by comparison. Come on MathWorks! Let’s have the ability to be stupidly verbose in our variable naming!

Non-standard assignment with getSymbols

21st April, 2011 3 comments

I recently came across a rather interesting investment blog, Timely Portfolio. I have a certain soft spot for that sort of thing, because using my data analysis skills to make a fortune is casually on my to-do list.

This blog makes regular use of a function getSymbols in the quantmod package. The power and simplicity of the function is fantastic: with one short line of code, you can retrieve historical data on any stock, bond or index that you fancy. It does have one oddity though. In R, we are all used to assigning values to variable with <-.

x <- mean(1:5)

Not for getSymbols this behaviour though. It uses a bizarre assignment procedure whereby the return value is assigned to a variable with the same name as the Symbols parameter, into an environment of your choice (the global environment by default). For example, getSymbols("DGS10",src="FRED") creates a variable named DGS10.

When retrieving many symbols, this can get a little clunky. Here’s a snippet from a recent post.

getSymbols("DGS10",src="FRED") #load 10yTreasury
getSymbols("DFII10",src="FRED") #load 10yTIP for real return
getSymbols("DTWEXB",src="FRED") #load US dollar
getSymbols("SP500",src="FRED") #load SP500

I see lots of code repetition, which means that is is a prime opportunity for some refactoring. These four lines can be condensed with a call to lapply by passing a vector to getSymbols (EDIT CREDIT: thanks Owe!).

symbol_names <- c("DGS10", "DFII10", "DTWEXB", "SP500")
#lapply(symbol_names, getSymbols, src = "FRED")  #see Owe's comment
getSymbols(symbol_names, src = "FRED")

Unfortunately, the non-standard assignment means that instead of having a nice list of each of our datasets, we have four separate variables. To fix this, we must create our own environment to store the results, then convert that to a list.

symbol_env <- new.env()
#lapply(symbol_names, getSymbols, src="FRED", env = symbol_env) 
getSymbols(symbol_names, src = "FRED", env = symbol_env)
list_of_symbols <- as.list(symbol_env)

Understanding environments is quite an advanced topic and a full explanation is beyond the scope of this post. In this case however, the idea is very simple. We need somewhere out of the way to store all the variables that getSymbols creates. This storage place is the environment symbol_env, which can be thought of a list with special variable scoping rules. Since environments and lists are such similar constructs, we can convert from one to the other with as.list. (list2env works in the other direction.)

Friday Function: setInternet2

15th April, 2011 2 comments

Corporate IT networks are a pain for programmers. Ideally, when programming, you want the freedom to download, install and run any software that you want. Unfortunately, in the interests of security, many programmers find themselves a little restricted at the office. (I’m sure that many network admins will protest that the situation works both ways – prgrammers are also a pain for corporate IT networks.)

With the default installation of R, you may find that connecting from R to the internet doesn’t work. This is a shame, since there are many useful features of R that require internet access, not least downloading packages and scraping data from webpages with RCurl.

For Windows users, there is a solution. By typing setInternet2(TRUE), R connects via internet2.dll, which Internet Explorer uses. From a network point of view this makes R appear to be the same as Internet Explorer, and sneak through.

In order to have this functionality every time you run R, add that line of code to your file in the R.home("etc") directory. If you have control over the installation of R, choosing a custom installation gives you the option to connect via internet2.dll by default.