jump to navigation

The Joy of Regular Expressions November 21, 2007

Posted by Phill in General J2EE.
Tags: ,
comments closed

You may be aware of the following quote by Jamie Zawinski:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Regular expressions, or regexs (or whatever you call them) do have that reputation. I think it’s ill-deserved, they can be an immensely valuable tool. Input validation – for example, checking the format of an email address – would be far more difficult without them.

Let me give you an example of how a well-used regex can make validation easier.

Suppose you want to validate a username in a web application. You want to make sure that:

  1. it’s between 5-10 characters;
  2. it starts with a letter but after that can contain numbers and the characters “.”, “-” and “_” (period, hyphen and underscore);
  3. it must not have two periods, hyphens or underscores in a row (i.e., “some..user” is not allowed)

All this validation can be accomplished by one regex: ^(?!.*[-_.]{2})[a-zA-Z][\w.-]{4,9}$

Let’s just break this down to see what’s happening.

(?!.*[-_.]{2}) performs a “zero-width negative lookahead” (more on that later) to check that there are no double hyphens, periods or underscores in the username, satisfying rule 3.

[a-z][A-Z] ensures that the username starts with a letter, satisfying the first part of rule 2.

[\w.-]{4,9} ensures that the initial letter is followed by 4-9 alphanumeric digits or the allowed other characters, satisfying rule 1 and the second part of rule 2.

Now, the reason I’m writing this post today is that yesterday we had to perform some very similar validation, and I’d never even heard of a “zero-width negative lookahead”. Believe me, without zero-width assertions it would be almost impossible to validate. If you haven’t read up on them yet, I would definitely encourage you to have a read through the section on zero-width assertions on the Regex Tutorial.

Advertisements