## The Stats Clinic

27th July, 2011

Here at HSL we have a lot of smart kinda-numerate people who have access to a lot of data. On a bad day, kinda-numerate includes myself, but in general I’m talking about scientists who have have done an introductory stats course, but not much else. When all you have is a t-test, suddenly everything looks like two groups of normally distributed numbers that you need to know how significantly different their means are.

While we have a pretty good cross-disciplinary setup here, the ease of calculating a mean here or a standard deviation there means that many scientists can’t resist a piece of the number crunching action. Then suddenly there’s an Excel monstrosity that nobody understands rearing its ugly head.

Management has enlightenedly decided to fund a stats clinic, so us number nerds can help out the rest of the lab without any paperwork overhead (which was the biggest reason to put off asking for help). They didn’t like my slogan, but hey, you can’t have everything.

I’m really interested to hear how other organisations deal with this issue. Let me know in the comments.

## The method in the mirror: reflection in R

Reflection is a programming concept that sounds scarier than it is. There are three related concepts that fall under the umbrella of reflection, and I’ll be surprised if you haven’t come across most of these code ideas already, even if you didn’t know it was called reflection.

The first concept is examination of your variables. In R, this mostly means calling `class`, `attributes`, `str` and `summary`. S4 classes also have the function `showMethods` to, ahem, show their methods.

The second concept is accessing variables by name; in R this means calling `get` or `getAnywhere`, the latter being used for functions that aren’t exported from a package namespace. (Start by using `get`; if that doesn’t work, try `getAnywhere`.)

```#without reflection
mean(1:5)

#with reflection
get("mean")(1:5)
getAnywhere("mean")(1:5)
```

The main use of this is with functions that return names of functions, like `ls`. For example, to retrieve every local variable in list form, use

```lapply(ls(), get)
```

Again, it’s a tiny bit more complicated for S4 classes, which have a variety of extra functions for inspecting them. There’s `getMethod`, `getSlots` and a bunch of other functions. Try `apropos("^get")` to find them.

The third concept is to evaluate code in string form. The mean example from above becomes

```eval(parse(text = "mean(1:5)"))
```

A word of warning about this last concept. It is very powerful, but also one of the easiest ways to write completely incomprehensible buggy code. Don’t use it, except in the last resort.

So there you have it, reflection in three easy steps.

## Testing for valid variable names

I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.

Not having an `is_valid_variable_name` function is one of those odd omissions from R, and the `assign` function doesn’t check validity.

To recap, there are a few rules on what makes a valid variable name.  From `?name`

Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).

The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of `is*` type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of

```is_something <- function(x)
{
if(!some_condition) return(FALSE)
if(!some_other_condition) return(FALSE)
#etc.
return(TRUE)
}
```

The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:

```is_something <- function(x)
{
ok
ok[!some_condition] <- FALSE
ok[!some_other_condition] <- FALSE
#etc.
ok
}
```

So, back to our `is_valid_variable_name` function. The first condition is easy to implement.

```is_valid_variable_name <- function(x)
{
ok
#is name too long?
max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
#More logic still to come
}
```

Now it gets trickier. In `?make.names` we have

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.

When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.

In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The `MSToolkit` package contains a function `validNames` which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.

```"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*\$"
```

That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]”), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]”), then finish (“\$”).

The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters `áçöíþ <- 1` that the regex `a-zA-Z` won’t pick up. there’s an easy fix here that nearly always works. We swap `0-9a-zA-Z` for `[:alnum:]` and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters `µ` but the “alpha” regex won’t match them.

```grepl("[[:alpha:]]", "µ") # FALSE
```

Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.

Underscores aren’t permitted…

```validNames("foo_bar")  #throws error
```

and neither are names consisting only of dots…

```validNames("..")       #throws error
```

but many of the reserved words (see `?Reserved` for the list) are:

```validNames("if")       #TRUE
```

I don’t want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, `...`, and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don’t.

```is_valid_variable_name <- function(x, allow_reserved = TRUE)
{
ok <- rep.int(TRUE, length(x))

#is name too long?
max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
ok[nchar(x) > max_name_length] <- FALSE

#is it a reserved variable, i.e.
#an ellipsis or two dots then a number?
if(!allow_reserved)
{
ok[x == "..."] <- FALSE
ok[grepl("^\\.{2}[[:digit:]]+\$", x)] <- FALSE
}

#is it a reserved word?
reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_")
ok[grepl(paste(reserved_words, collapse = "|"), x)]

#are there any illegal characters?
ok[!grepl("^[[:alnum:]_.]+\$", x)] <- FALSE

ok[grepl("^_", x)] <- FALSE

ok[grepl("^\\.[[:digit:]]", x)] <- FALSE

ok
}
```

So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. `make.names`, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then `x` will be the same as `make.names(x)`. As a bonus, we can easily check for `unique` variable names.

```is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE)
{
ok <- rep.int(TRUE, length(x))

#is name too long?
max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
ok[nchar(x) > max_name_length] <- FALSE

#is it a reserved variable, i.e.
#an ellipsis or two dots then a number?
if(!allow_reserved)
{
ok[x == "..."] <- FALSE
ok[grepl("^\\.{2}[[:digit:]]+\$", x)] <- FALSE
}

#are names valid (and maybe unique)
ok[x != make.names(x, unique = unique)] <- FALSE

ok
}
```

While this answer isn’t quite as satisfactory because you can’t see what’s going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then `make.names` will hopefully be updated to match it. And that makes it good enough for me.