Friday function triple bill: with vs. within vs. transform
When you first learnt about data frames in R, I’m sure that, like me, you thought “This is a lot of hassle having to type the names of data frames over and over in order to access each column”.
library(MASS) anorexia$wtDiff <- anorexia$Postwt - anorexia$Prewt #I have to type anorexia how many times?
Indeed, any time you see chunks of code repeated over and over, there’s an indication that they need rewriting. Thus the first time you discovered the attach
function was a blissful moment. Ah, the hours you would save by not typing variable names! Alas, those hours were more than made up for by the hundreds of hours you spent debugging impenetrable buggy code that was a side effect of attach
.
attach(anorexia) anorexia$wtDiff <- Postwt - Prew #Deliberate typo! detach(anorexia)
In the above snippet of code, the typo causes execution to halt after the second line, so the call to detach
never happens. Then, after you’ve fixed the typo and run the code again, anorexia
is on your search path twice. This is problematic because when you detach it, there is still a copy of the data frame on the search path. Cue wailing and gnashing of teeth as you waste half an hour trying to find the bug.
Today we’re going to look at three functions that let you manipulate data frames, without the nasty side-effects of attach
– with
, within
and transform
.
For adding (or overwriting) a column to a data frame, like in the above example, any of the three functions is perfectly adequate; they just have slightly different syntaxes. with
often has the most concise formulation, though there isn’t much in it.
anorexia$wtDiff <- with(anorexia, Postwt - Prewt) anorexia <- within(anorexia, wtDiff2 <- Postwt - Prewt) anorexia <- transform(anorexia, wtDiff3 = Postwt - Prewt)
For multiple changes to the data frame, all three functions can still be used, but now the syntax for with
is more cumbersome. I tend to favour within
or transform
in these situations.
fahrenheit_to_celcius <- function(f) (f - 32) / 1.8 airquality[c("cTemp", "logOzone", "MonthName")] <- with(airquality, list( fahrenheit_to_celcius(Temp), log(Ozone), month.abb[Month] )) airquality <- within(airquality, { cTemp2 <- fahrenheit_to_celcius(Temp) logOzone2 <- log(Ozone) MonthName2 <- month.abb[Month] }) airquality <- transform(airquality, cTemp3 = fahrenheit_to_celcius(Temp), logOzone3 = log(Ozone), MonthName3 = month.abb[Month] )
The most important lesson to take away from this is that if you are manipulating data frames, then with
, within
and transform
can be used almost interchangeably, and all of them should be used in preference to attach
. For further refinement, I prefer with
for single updates to data frames, and within
or transform
for multiple updates.
Good post, I went through the same process but had forgotten about with and transform. For me the fix was just to give dataframes short names- (d,dat,dp,ds, et cetera).
Great post! I spent ages the other day trying to work out why a dataframe that I had used attach() with didn’t want to go away. I ended up using with() instead, but didn’t understand why I had the problems with attach(). Now it makes sense!
Great post! Thanks for the tip.
I never liked attach. I don’t like polluting the global namespace. The $ notation was often too verbose. This is great.
Thanks!
why not just use 1 or 2 letters variable names for data.frames?
a for anorexia
aq for airquality
the column names also help to understand what these variables stand for
That’s a good question, and in fact there is a very string argument against very short variable names. They make the code less readable. Calling a data frame “anorexia” gives you context about what the data are about. A data frame called “a” could be about anything – you need additional knowledge to understand the code, which is bad. I come from a maths background, where everything is just x’s and y’s, so it took me a while to see the virtue in longer names. The main advantage comes when you are looking at code six months later, and can’t remember what things meant.
I see your point richierock, but OTOH there’s no loss of clarity if your code is clearly documented with comments, so it is known that it is about eating disorders. Then a data frame “a” for anorexia, and “b” for bulimia ought to be just meaningful if they are first used with a comment that spells out the full name.
Truth is though, I often make minor variations on a data frame and use arbitrary third character combinations. So in actual practice things are not so clear.
Write two versions of your code, one with readable variable names and one with abbreviated names plus explanations in the comments.
Pass your code to some colleagues and ask them which is easier to comprehend.