« Sign away your privacy with NZ Post! | Main | Hidden Linux : Games penguins play »


Most Linux users avoid regular expressions (or "regex"), and unless you're a programmer you don't really need to know much about them. But a little knowledge can be useful as regular expressions are often presented as an option to enhance searches. You'll find them under More Options in OpenOffice.org's Find & Replace, for example.

Think of regular expressions as a kind of mini-programming language. Rather than go into details -- you'll find a great resource here -- I'll just list some examples you might find useful in your searches.


Metacharacters

Regex will find any character you enter except for the following:

   .   ?   |   ^   $   *   +   [   ]   (   )   \

They're known as "metacharacters" and are part of the regex "language".

The fullstop (".") will match any character except for line breaks:
  Searching on c.t will find catcotcutc2tc#t, etc.
 

A question mark ("?") makes the preceding character in the search optional:
   Searching on colou?r will match colour and color.


The vertical bar ("|") separates alternatives:
   Searching on gray|grey will match both gray and grey.
   Searching on abc|def|xyz will match abc, def or xyz.

 
The caret ("^") matches the start of a string -- which is to say, after any line break:
   Searching on ^blah will only match the first blah in a new line beginning blah, blah, blah

 
The dollar sign ("$") matches the end of a string -- which is to say, before any line break:
   Searching on blah$ will only match the last blah in a line ending in blah, blah, blah

 
The asterisk ("*") matches the preceding character zero or more times:
   Searching on ab*c will match acabcabbcabbbc, etc.

 
The plus sign ("+") matches the preceding character once or more times:
   Searching on ab+c will match abcabbcabbbc, etc. but not ac.

  
Square brackets ("[" and "]") find single character matches:
   Searching on gr[ae]y will match gray or grey but not graey or gry.
   Searching on in[du] will match the ind in Windows and the inu in Linux.

   To find multiple characters, repeat the square brackets:
     Searching on [1-9][0-9] will match all double-digit numbers from 10 to 99.

   Add a hyphen to indicate a range:
     Searching on z[1-3] will match z1z2 and z3 but not z4.

   Multiple ranges are allowed:
     Searching on z[1-3a-c] will match z1z2z3zazb and zc but not z4zd or zA.
     Searching on z[1-3a-cA-C] will match all of the above plus zAzB and zC.

   A caret ("^") inside square brackets reverses the sense of the search:
     Searching on z[^1-3] will match z4z0 and zB but not z1z2 or z3.

 
Brackets ("(" and ")") group a series of pattern elements into a single element:
   Searching on (.pet) will find carpet, parapet and petal.
   Searching on (g..)|(m..) in the string program name will find matches in gram n and me .


The backslash ("\") allows you to search for any of the metacharacters:
   Searching on $2 in the string $2.50 will find nothing because $ is a metacharacter.
   Searching on \$2 in the string $2.50 will find $2.

   Searching on \\ in the string C:\filename will find \.
   Searching on \\\\ in the string C:\\filename will find \\.

   Searching on 1\+ in the string 1+2=3 will find 1+.



Character Classes

The backslash ("\") is also associated with special characters:
     \d matches any digit. Equivalent to [0-9].
     \D matches any non-digit. Equivalent to [^0-9].

     \w matches any alphanumeric character plus the underscore ("_").  Equivalent to [A-Za-z0-9_].
     \W matches any non-alphanumeric character excluding underscore. Equivalent to [^A-Za-z0-9_].

     \s matches whitespace characters -- including tabs and line breaks.  Equivalent to [\f\n\r\t\v].
     \S matches any non-white space character. Equivalent to [^\f\n\r\t\v].

     \t matches tab characters.

     \r matches carriage returns.
     \n matches line feeds.
         Note that Windows uses \r\n to terminate lines. Linux just uses \n.

    \f matches form feeds.
    \v matches vertical tabs.

    \b matches a word boundary:
        er\b will only match the last er in "observer ".
        \bword\b will find "word" in " word ", " word," and "-word." but not crossword or wordy.

    \B matches a non-word boundary:
        er\B will only match the first er in "observer ".

    \A matches the start of a string.
       \A. will match the a in abc.

     \Z matches the end of the string.
       .\Z matches f in abcdef



Quantifiers

Curly braces ("{" and "}") specify the number of times the preceeding character is to be searched:
   Searching on o{2} will match the oo's in good, food and Wooooooo! but not oboe.



Conclusion
By now you'll probably appreciate how powerful regex's can be. By combining metacharacters, you can perform some pretty sophisticated matches. Searching for:

\b[1-9][0-9]{3}\b
will find all numbers between 1000 and 9999.

\b[1-9][0-9]{2,4}\b will find all numbers between 100 and 99999.

(\<(/?[^\>]+)\>) will find all HTML tags.

(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,6}) will find all email addresses.

((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}) will find all 8-15 character strings with at least one upper case letter, one lower case letter, and one digit. In short, it'll identify all potential passwords!


<--Previous Hidden Linux      Next Hidden Linux -->




Comments

Thanks Adrian, I've changed that to read, "... will match both gray and grey.:

Good article. Just a small typo though: The line that describes the use of the vertical bar "|" should read:

Searching on gray|grey will match gray OR grey.

Cheers

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)