Archive

Posts Tagged ‘regex’

Regular expressions for everyone else

25th September, 2014 Leave a comment

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

library(devtools)
install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.

Testing for valid variable names

4th July, 2011 Leave a comment

I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.

Not having an is_valid_variable_name function is one of those odd omissions from R, and the assign function doesn’t check validity.

To recap, there are a few rules on what makes a valid variable name.  From ?name

Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).

The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of is* type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of

is_something <- function(x)
{
  if(!some_condition) return(FALSE)
  if(!some_other_condition) return(FALSE)
  #etc.
  return(TRUE)
}

The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:

is_something <- function(x)
{
  ok
  ok[!some_condition] <- FALSE
  ok[!some_other_condition] <- FALSE
  #etc.
  ok
}

So, back to our is_valid_variable_name function. The first condition is easy to implement.

is_valid_variable_name <- function(x) 
{
  ok
  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  #More logic still to come
}

Now it gets trickier. In ?make.names we have

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.

When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.

In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The MSToolkit package contains a function validNames which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.

"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*$"

That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]”), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]”), then finish (“$”).

The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters áçöíþ <- 1 that the regex a-zA-Z won’t pick up. there’s an easy fix here that nearly always works. We swap 0-9a-zA-Z for [:alnum:] and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters µ but the “alpha” regex won’t match them.

grepl("[[:alpha:]]", "µ") # FALSE

Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.

Underscores aren’t permitted…

validNames("foo_bar")  #throws error

and neither are names consisting only of dots…

validNames("..")       #throws error

but many of the reserved words (see ?Reserved for the list) are:

validNames("if")       #TRUE

I don’t want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, ..., and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don’t.

is_valid_variable_name <- function(x, allow_reserved = TRUE)  
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #is it a reserved word?
  reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_")
  ok[grepl(paste(reserved_words, collapse = "|"), x)]
  
  #are there any illegal characters?
  ok[!grepl("^[[:alnum:]_.]+$", x)] <- FALSE

  #does it start with underscore?
  ok[grepl("^_", x)] <- FALSE

  #does it start with dot then a number?
  ok[grepl("^\\.[[:digit:]]", x)] <- FALSE

  ok
}

So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. make.names, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then x will be the same as make.names(x). As a bonus, we can easily check for unique variable names.

is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE) 
{
  ok <- rep.int(TRUE, length(x))

  #is name too long?
  max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L
  ok[nchar(x) > max_name_length] <- FALSE

  #is it a reserved variable, i.e.
  #an ellipsis or two dots then a number?
  if(!allow_reserved)
  {
    ok[x == "..."] <- FALSE     
    ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE   
  }

  #are names valid (and maybe unique)
  ok[x != make.names(x, unique = unique)] <- FALSE

  ok
}

While this answer isn’t quite as satisfactory because you can’t see what’s going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then make.names will hopefully be updated to match it. And that makes it good enough for me.

Follow

Get every new post delivered to your Inbox.

Join 228 other followers