Archive

Archive for July, 2011

The Stats Clinic

27th July, 2011 1 comment

Stats clinic logo
Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.

While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.

Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.

I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.

Tags: ,

The method in the mirror: reflection in R

17th July, 2011 Leave a comment

Reflection is a programming concept that sounds scarier than it is. There are three related concepts that fall under the umbrella of reflection, and I’ll be surprised if you haven’t come across most of these code ideas already, even if you didn’t know it was called reflection.

The first concept is examination of your variables. In R, this mostly means calling class, attributes, str and summary. S4 classes also have the function showMethods to, ahem, show their methods.

The second concept is accessing variables by name; in R this means calling get or getAnywhere, the latter being used for functions that aren’t exported from a package namespace. (Start by using get; if that doesn’t work, try getAnywhere.)

#without reflection
mean(1:5) 

#with reflection
get("mean")(1:5)
getAnywhere("mean")(1:5)

The main use of this is with functions that return names of functions, like ls. For example, to retrieve every local variable in list form, use

lapply(ls(), get)

Again, it’s a tiny bit more complicated for S4 classes, which have a variety of extra functions for inspecting them. There’s getMethod, getSlots and a bunch of other functions. Try apropos("^get") to find them.

The third concept is to evaluate code in string form. The mean example from above becomes

eval(parse(text = "mean(1:5)"))

A word of warning about this last concept. It is very powerful, but also one of the easiest ways to write completely incomprehensible buggy code. Don’t use it, except in the last resort.

So there you have it, reflection in three easy steps.

Tags: ,

Testing for valid variable names

4th July, 2011 Leave a comment

I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.

Not having an is_valid_variable_name function is one of those odd omissions from R, and the assign function doesn’t check validity.

To recap, there are a few rules on what makes a valid variable name.  From ?name

Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).

The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of is* type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of

is_something <- function(x)
{
  if(!some_condition) return(FALSE)
  if(!some_other_condition) return(FALSE)
  #etc.
  return(TRUE)
}

The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:

is_something <- function(x)
{
  ok
  ok[!some_condition] <- FALSE
  ok[!some_other_condition] <- FALSE
  #etc.
  ok
}

So, back to our is_valid_variable_name function. The first condition is easy to implement.

is_valid_variable_name <- function(x) 
{
  ok
  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  #More logic still to come
}

Now it gets trickier. In ?make.names we have

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.

When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.

In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The MSToolkit package contains a function validNames which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.

"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*$"

That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]“), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]“), then finish (“$”).

The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters áçöíþ <- 1 that the regex a-zA-Z won’t pick up. there’s an easy fix here that nearly always works. We swap 0-9a-zA-Z for [:alnum:] and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters µ but the “alpha” regex won’t match them.

grepl("[[:alpha:]]", "µ") # FALSE

Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.

Underscores aren't permitted...

validNames("foo_bar")  #throws error

and neither are names consisting only of dots...

validNames("..")       #throws error

but many of the reserved words (see ?Reserved for the list) are:

validNames("if")       #TRUE

I don't want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, ..., and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don't.

is_valid_variable_name <- function(x, allow_reserved = TRUE)  
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #is it a reserved word?
  reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_")
  ok[grepl(paste(reserved_words, collapse = "|"), x)]
  
  #are there any illegal characters?
  ok[!grepl("^[[:alnum:]_.]+$", x)] <- FALSE

  #does it start with underscore?
  ok[grepl("^_", x)] <- FALSE

  #does it start with dot then a number?
  ok[grepl("^\\.[[:digit:]]", x)] <- FALSE

  ok
}

So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. make.names, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then x will be the same as make.names(x). As a bonus, we can easily check for unique variable names.

is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE) 
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #are names valid (and maybe unique)
  ok[x != make.names(x, unique = unique)] <- FALSE

  ok
}

While this answer isn't quite as satisfactory because you can't see what's going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then make.names will hopefully be updated to match it. And that makes it good enough for me.

Follow

Get every new post delivered to your Inbox.

Join 204 other followers