Archive for September, 2014

Regular expressions for everyone else

25th September, 2014 Leave a comment

Regular expressions are an amazing tool for working with character data, but they are also painful to read and write.  Even after years of working with them, I struggle to remember the syntax for negative lookahead, or which way round the start and end anchor symbols go.

Consequently, I’ve created the regex package for human readable regular expression generation.  It’s currently only on github (CRAN version arriving as soon as you give me feedback), so you can get it with:

install_github("regex", "richierocks")

Before, if I wanted to find the names of all the operators in the base package, my workflow would be something like:

I need ls, with a pattern that matches punctuation.  So I open the ?regex help page and look for the character class for punctuation.  My first attempt is then:

ls(baseenv(), pattern = "[:punct:]")

Ok, wait, the class has to be wrapped in square brackets itself.

ls(baseenv(), pattern = "[[:punct:]]")

Better, but that’s matching S3 classes and some functions too.  I want to match only where there’s punctuation at the start.  What’s the anchor for the start?  Back to reading ?regex.  Sod it, there’s too much text here; it’s probably a dollar sign.

ls(baseenv(), pattern = "$[[:punct:]]")

Hmm, nope.  Must be a caret.

ls(baseenv(), pattern = "^[[:punct:]]")

Hurrah!  Still, it took me 5 minutes for a simple example.  For something more complicated like matching email addresses or telephone numbers or particular time formats, building regular expressions this way can become time consuming and frustrating.  Here’s the equivalent syntax using regex.

ls(baseenv(), pattern = START %c% punct())

START; is just a constant that returns a caret. The %c% operator is a wrapper to paste0, and punct is a function returning a group of punctuation.  You can pass it argument to match multiple punctuation.  For example punct(3, 5) matches between 3 and 5 punctuation characters.

You also get lower-level functions.  punct(3, 5) is a convenience wrapper for repeated(group(PUNCT), 3, 5).

As a more complicated example, you can match an email address like:

one_or_more(group(ASCII_ALNUM %c% "._%+-")) %c%
  "@" %c%
  one_or_more(group(ASCII_ALNUM %c% ".-")) %c%
  DOT %c%
  ascii_alpha(2, 4)

This reads Match one or more letters, numbers, dots, underscores, percents, plusses or hyphens. Then match an ‘at’ symbol. Then match one or more letters, numbers, dots, or hyphens. Then match a dot. Then match two to four letters.

There are also functions for tokenising, capturing, and lookahead/lookbehind, and an operator for alternation.  I’m already rather excited about how much easier regular expressions have become for me to use.