Thursday 26 August 2010

Regular Expressions

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.

Some Definitions

We are going to be using the terms literal, metacharacter, target string, escape sequence and search string in this overview. Here is a definition of our terms:

literal A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, it is literally the string we want to find.

metacharacter A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the character ^ (circumflex or caret) is a metacharacter.

escape sequence An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal, for example, if we want to find ^ind in w^indow then we use the search string \^ind and if we want to find \\file in the string c:\\file then we would need to use the search string \\\\file (each \ we want to search for (a literal) is preceded by an escape sequence \).

target string This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.

search expression This term describes the expression that we will be using to search our target string, that is, the pattern we use to find what we want.

Brackets, Ranges and Negation

Bracket expressions introduce our first metacharacters, in this case the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now. These lists can be grouped into what are known as Character Classes typically comprising well know groups such as all numbers etc.

[ ] Match anything inside the square brackets for one character position once and only once, for example, [12] means match the target to either 1 or 2 while [0123456789] means match to any character in the range 0 to 9.

- The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].

You can define more than one range inside a list e.g. [0-9A-C] means check for 0 to 9 and A to C (but not a to c).

NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.

^ The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.

NOTE:Spaces, or in this case the lack of them, between ranges are very important.

Positioning (or Anchors)

^ The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.

$ The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'.

. The . (period) means any character(s) in this position, for example, ton. will find tons and tonneau but not wanton because it has no following character.

Iteration 'metacharacters'

The following is a set of iteration metacharacters (a.k.a. quantifiers) that can control the number of times a character or string is found in our searches.

? The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color and colour.

* The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree and tread and trough.

+ The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree and tread but not trough.

More 'metacharacters'

The following is a set of additional metacharacters that provide added power to our searches:

() The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together.

"MSIE.(5\.[5-9])|([6-9])" matches MSIE 5.5 (or greater) OR MSIE 6+.

| The | (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a|e)y will find 'gray' or 'grey'.

Common Extensions and Abbreviations

Character Class Abbreviations

\d Match any character in the range 0 - 9
\D Match any character NOT in the range 0 - 9
\s Match any whitespace characters (space, tab etc.).
\S Match any character NOT whitespace (space, tab).
\w Match any character in the range 0 - 9, A - Z and a - z
\W Match any character NOT the range 0 - 9, A - Z and a - z

Positional Abbreviations

\b Word boundary. Match any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bton\b will find ton but not tons, but \bton will find tons.
\B Not word boundary. Match any character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus \Bton\B will find wantons but not tons, but ton\B will find both wantons and tons.

See Regular Expressions - User guide for more information.

http://www.regular-expressions.info/

No comments:

Post a Comment