Hidden Linux: Regular expressions, by example


Most Linux users avoid regular expressions (or "regex"), and unless you're a programmer you don't really need to know much about them. But a little knowledge can be useful as regular expressions are often presented as an option to enhance searches. You'll find them under More Options in OpenOffice.org's Find & Replace, for example.
Think of regular expressions as a kind of mini-programming language. Rather than go into details -- you'll find a great resource here -- I'll just list some examples you might find useful in your searches.
Metacharacters
Regex will find any character you enter except for the following:
. ? | ^ $ * + [ ] ( ) \
They're known as "metacharacters" and are part of the regex "language".
| The fullstop (".") will match any character except for line breaks: |
| A question mark ("?") makes the
preceding character in the search optional: |
| The vertical bar ("|") separates alternatives: |
Searching on abc|def|xyz will match abc, def or xyz.
| The caret ("^") matches the start of a string -- which is to say, after any line break: |
| The dollar sign ("$") matches the end of a string -- which is to say, before any line break: |
| The asterisk ("*") matches the preceding character zero or more times: |
| The plus sign ("+") matches the preceding character once or more times: |
| Square brackets ("[" and "]") find single character matches: |
Searching on in[du] will match the ind in Windows and the inu in Linux.
To find multiple characters, repeat the square brackets:
Searching on [1-9][0-9] will match all double-digit numbers from 10 to 99.
Add a hyphen to indicate a range:
Searching on z[1-3] will match z1, z2 and z3 but not z4.
Multiple ranges are allowed:
Searching on z[1-3a-c] will match z1, z2, z3, za, zb and zc but not z4, zd or zA.
Searching on z[1-3a-cA-C] will match all of the above plus zA, zB and zC.
A caret ("^") inside square brackets reverses the sense of the search:
Searching on z[^1-3] will match z4, z0 and zB but not z1, z2 or z3.
| Brackets ("(" and ")") group a series of pattern elements into a single element: |
Searching on (g..)|(m..) in the string program name will find matches in gra, m n and me .
| The backslash ("\") allows you to search for any of the metacharacters: |
Searching on \$2 in the string $2.50 will find $2.
Searching on \\ in the string C:\filename will find \.
Searching on \\\\ in the string C:\\filename will find \\.
Searching on 1\+ in the string 1+2=3 will find 1+.
Character Classes
| The backslash ("\") is also associated with special characters: |
\D matches any non-digit. Equivalent to [^0-9].
\w matches any alphanumeric character plus the underscore ("_"). Equivalent to [A-Za-z0-9_].
\W matches any non-alphanumeric character excluding underscore. Equivalent to [^A-Za-z0-9_].
\s matches whitespace characters -- including tabs and line breaks. Equivalent to [\f\n\r\t\v].
\S matches any non-white space character. Equivalent to [^\f\n\r\t\v].
\t matches tab characters.
\r matches carriage returns.
\n matches line feeds.
Note that Windows uses \r\n to terminate lines. Linux just uses \n.
\f matches form feeds.
\v matches vertical tabs.
\b matches a word boundary:
er\b will only match the last er in "observer ".
\bword\b will find "word" in " word ", " word," and "-word." but not crossword or wordy.
\B matches a non-word boundary:
er\B will only match the first er in "observer ".
\A matches the start of a string.
\A. will match the a in abc.
\Z matches the end of the string.
.\Z matches f in abcdef
Quantifiers
| Curly braces ("{" and "}") specify the number of times the preceeding character is to be searched: |
Conclusion
By now you'll probably appreciate how powerful regex's can be. By combining metacharacters, you can perform some pretty sophisticated matches. Searching for:
\b[1-9][0-9]{3}\b will find all numbers between 1000 and 9999.
\b[1-9][0-9]{2,4}\b will find all numbers between 100 and 99999.
(\<(/?[^\>]+)\>) will find all HTML tags.
(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,6}) will find all email addresses.
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}) will find all 8-15 character strings with at least one upper case letter, one lower case letter, and one digit. In short, it'll identify all potential passwords!
<--Previous Hidden Linux Next Hidden Linux -->



Comments
Thanks Adrian, I've changed that to read, "... will match both gray and grey.:
Posted by: Geoff | July 15, 2009 4:10 PM
Good article. Just a small typo though: The line that describes the use of the vertical bar "|" should read:
Searching on gray|grey will match gray OR grey.
Cheers
Posted by: Adrian | July 15, 2009 2:16 PM