Home > R > Friday function triple bill: with vs. within vs. transform

Friday function triple bill: with vs. within vs. transform

When you first learnt about data frames in R, I’m sure that, like me, you thought “This is a lot of hassle having to type the names of data frames over and over in order to access each column”.

library(MASS)
anorexia$wtDiff <- anorexia$Postwt - anorexia$Prewt #I have to type anorexia how many times?

Indeed, any time you see chunks of code repeated over and over, there’s an indication that they need rewriting. Thus the first time you discovered the attach function was a blissful moment. Ah, the hours you would save by not typing variable names! Alas, those hours were more than made up for by the hundreds of hours you spent debugging impenetrable buggy code that was a side effect of attach.

attach(anorexia)
anorexia$wtDiff <- Postwt - Prew   #Deliberate typo!
detach(anorexia)

In the above snippet of code, the typo causes execution to halt after the second line, so the call to detach never happens. Then, after you’ve fixed the typo and run the code again, anorexia is on your search path twice. This is problematic because when you detach it, there is still a copy of the data frame on the search path. Cue wailing and gnashing of teeth as you waste half an hour trying to find the bug.

Today we’re going to look at three functions that let you manipulate data frames, without the nasty side-effects of attachwith, within and transform.

For adding (or overwriting) a column to a data frame, like in the above example, any of the three functions is perfectly adequate; they just have slightly different syntaxes. with often has the most concise formulation, though there isn’t much in it.

anorexia$wtDiff <- with(anorexia, Postwt - Prewt)
anorexia <- within(anorexia, wtDiff2 <- Postwt - Prewt)
anorexia <- transform(anorexia, wtDiff3 = Postwt - Prewt)

For multiple changes to the data frame, all three functions can still be used, but now the syntax for with is more cumbersome. I tend to favour within or transform in these situations.

fahrenheit_to_celcius <- function(f) (f - 32) / 1.8
airquality[c("cTemp", "logOzone", "MonthName")] <- with(airquality, list(
  fahrenheit_to_celcius(Temp),
  log(Ozone),
  month.abb[Month]
)) 
airquality <- within(airquality, 
{
  cTemp2     <- fahrenheit_to_celcius(Temp)
  logOzone2  <- log(Ozone)
  MonthName2 <- month.abb[Month]
}) 
airquality <- transform(airquality, 
  cTemp3     = fahrenheit_to_celcius(Temp),
  logOzone3  = log(Ozone),
  MonthName3 = month.abb[Month]
)

The most important lesson to take away from this is that if you are manipulating data frames, then with, within and transform can be used almost interchangeably, and all of them should be used in preference to attach. For further refinement, I prefer with for single updates to data frames, and within or transform for multiple updates.

About these ads
  1. fd
    29th April, 2011 at 23:18 pm

    Good post, I went through the same process but had forgotten about with and transform. For me the fix was just to give dataframes short names- (d,dat,dp,ds, et cetera).

  2. 30th April, 2011 at 0:51 am

    Great post! I spent ages the other day trying to work out why a dataframe that I had used attach() with didn’t want to go away. I ended up using with() instead, but didn’t understand why I had the problems with attach(). Now it makes sense!

  3. 30th April, 2011 at 20:23 pm

    Great post! Thanks for the tip.

  4. Antony
    1st May, 2011 at 5:56 am

    I never liked attach. I don’t like polluting the global namespace. The $ notation was often too verbose. This is great.

  5. Han
    2nd May, 2011 at 7:42 am

    Thanks!

  6. shorty
    4th May, 2011 at 11:27 am

    why not just use 1 or 2 letters variable names for data.frames?
    a for anorexia
    aq for airquality
    the column names also help to understand what these variables stand for

    • 4th May, 2011 at 16:45 pm

      That’s a good question, and in fact there is a very string argument against very short variable names. They make the code less readable. Calling a data frame “anorexia” gives you context about what the data are about. A data frame called “a” could be about anything – you need additional knowledge to understand the code, which is bad. I come from a maths background, where everything is just x’s and y’s, so it took me a while to see the virtue in longer names. The main advantage comes when you are looking at code six months later, and can’t remember what things meant.

      • Harold Baize
        25th July, 2013 at 0:07 am

        I see your point richierock, but OTOH there’s no loss of clarity if your code is clearly documented with comments, so it is known that it is about eating disorders. Then a data frame “a” for anorexia, and “b” for bulimia ought to be just meaningful if they are first used with a comment that spells out the full name.
        Truth is though, I often make minor variations on a data frame and use arbitrary third character combinations. So in actual practice things are not so clear.

        • Richie
          25th July, 2013 at 8:52 am

          Write two versions of your code, one with readable variable names and one with abbreviated names plus explanations in the comments.

          Pass your code to some colleagues and ask them which is easier to comprehend.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 229 other followers

%d bloggers like this: