Zintis Regular Expression Cheat

1 Testing your regexp patterns

Two sites exist that make testing your regular expressions easy:

2 Patterns

2.1 Boolean 'or'

vertical bar gray|grey

2.1.1 You can use the 'or' in grep like this:

grep 'gray\|grey' *.txt

If you want to find all the lines that have BOTH strings on a line, try this grep 'string1.*string2\|string2.*string1' *.py

2.2 Grouping

Parenetheses used to scope and set precedence of operators gr(a|e)y is the same as gray|grey

2.3 Quantification

A quantifier specifies how often the preceding element is allowed to occur

? zero or one
* zero or more
+ one or more
{n} preceeding item is matched exactly n times
{min,} preceeding item is matched min or more times
{,max} preceeding item is matched not more than max times
{max,} This, I believe, is a typo. Tt should be {,max} as above.
{min,max} preceeding item is matched min times but not more than max times
. wildcard matches any single character

2.4 Quantification examples:

Char	Meaning	Example	Resultant Match
.	wildcard char	a.b	axb, ayb, azb, aWb, …
*	zero or more	ab*c	ac, abc, abbc, abbbc…
.*	wildcard star	a.*b	"a" and "b" at some later point
?	zero or one	colou?r	color and colour
*?	wildcard star	a.*?b	non-greedy
+	one or more	ab+c	abc abbc, abbbc…
+?	one or more	ab+c	same but non greedy
{n}	exactly n times	ab{6}c	abbbbbbc
{n,}	n or more times	ab{6,}c	abbbbbbbbbbc
{,n}	n or less times	ab{6}c	abbbc
{n}	exactly n times	ab{6}c	abbbbbbc
{n,m}	between n and	ab{2,4}c	abbc or abbbc or abbbbc
	m times

2.5 Metacharacters

Char	Meaning	Example	Resultant Match
^	starting	^Hello	Hello (only at the beginning of the line
$	end of strg	xray$	matches 'xray' at the END of the line
.	wildcard	a.c	matches any character
			abc unless in [] when it is a 'dot'
			eg. [a.c] ONLY matches 'a', '.', or 'c'
[ ]	single char	[abc]	matches a single character in the listed
			`set`. i.e. ONLY 'a', 'b', or 'c'
[a-f]	single char	[q-t]	matches ONLY 'q', 'r', 's' or 't'
	in range
[a-z]	lower case		matches a single lower case letter
[A-Z]	upper case		matches a singlee upper case letter
( )	sub-expression	(cat)	'cat' can be recalled later
			called an `extraction` using $1 for the
			first one, $2 the 2nd, etc.
( )	also a group	(red)	the letters must appear contigously
			red matches
			ordered matches
			order does NOT match
?	modifier		Modifie the *, +, ? to match as few
			times as possible

2.6 Examples

Regexp	Matches	Does NOT Match
.at	cat, bat, hat, mat, rat
[hc]at	cat, hat	bat, mat, rat
[^b]at	cat, hat, mat, rat	bat
^bat	bat (ONLY at the beginning of line	bat (anywhere else)
bat$	bat (ONY at the end of the line)	bat (anywhere else)
catordog	'cat' or 'dog' vertical bar here pls	bird

2.7 Character classes

[\x00-\x7f] ASCII characters
[A-Za-z0-9] Alphanumeric characters
[A-Za-z0-9_] Alphanumeric characters plus '_'
[^A-Za-z0-9_] non-word characters
[A-Za-z] Alphabetic characters
[ \t] space and tab
(?<=\W)(?=\w)|?<=\w)(?=\W) word boundaries
(?<=\W)(?=\W)|?<=\w)(?=\w) non-word boundaries

there are more

3 Common Regular Expressions

Match a [ by escaping it with a \ so matching [5] woule be \[5\]

ipv4 addresses:

([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})
but you do not need the round brackets around [0-9]{1,3}
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}
this can be rewritten as:

([0-9]{1,3}\.){3}[0-9]{1,3}


if there there are spaces before and after the address:
\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b

But that would match address lid 787.888.910.666 so we can be more accurate:
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}↵
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

alias addr="ip addr | egrep '([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})'"

3.1 regexp for multiple spaces.

If you're looking for a space, that would be " " (one space).

If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).

3.2 specialty regexp for multiple spaces

Here is a everything you need to know about whitespace in regular expressions:

:blank: Space or tab only # this is actaully bra-bra-doubledot-blank-doubledot-ket-ket :space: Whitespace # same for this, except 'space' in place of 'blank' \s Any whitespace character \v Vertical whitespace \h Horizontal whitespace x Ignore whitespace

3.3 advanced spaces in regexp:

\s{2,} This matches all occurrences of one or more whitespace characters. If you need to match the entire line, but only if it contains two or more consecutive whitespace characters:

^.*\s{2,}.*$ If the whitespaces don't need to be consecutive:

^(.*\s.*){2,}$

4 regexp are greedy (unless ? follows * or +)

They will gobble up the biggest block that matches the expression.

Regexp are by default greedy so a regexp of ^.*MOS by default will match everything from the beginning of the line to the last MOS found. i.e.

Nowadays a chicken leg is a rare dish MOS1 : The depth of a well is hard to fathom MOS2

then ^.*MOS will match:

Nowadays a chicken leg is a rare dish MOS1 : The depth of a well is hard to
fathom MOS

But if your regexp is ^.*?MOS it is not greedy and you would match the first MOS it found. eg:

Nowadays a chicken leg is a rare dish MOS

So sed 's/^.*MOS//' would get you 2

and sed 's/^.*?MOS//' would get you 1 : The depth of a well is hard to fathom MOS

5 Python import re

Regular expressions module, re, needs to be imported before use. It is part of the standard python library, so all pythons should have re.

Use is re.search() to see if a string matches a regexp, similar to find() method for strings

use re.findall() to extract portiosn of a string that match the regexp, which is similar to running find() followed by slicing var[5:10]

5.1 Home

6 Sub-expressions ()

In the search part of a regex, expressions wrapped in brackets () are stored in variables that can be recalled using $1 for the first set of brackets, $2 for the 2nd, etc. The regex inside the brackets are called a "capturing group" They can be used in the replacement pattern of the grep or other utilities.

$n includes the last substring matched by the nth capturing group
${ name } includes the last substring matched by the named group designated by {?<name> ) in the replacement string
$$ includes a single literal "$" in the replacement string.
$& includes a copy fo the entire match in the replacement string
$` includes all the text of the input string before the match in the replacement string
$' includes all the text of the input string after the match in the replacement string
$+ includes the last group captured in the replacement string
$_ includes the entire input string in the replacement string

Example:

String searched: "@gmail.com addr–"
Regexp: (\w+)\W+(\w+)

$` –> "@" $& –> "gmail.com addr" $& –> $& –> $& –> $& –> $& –>

egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt will match all llines that do NOT contain for

So file cars.txt

plym    fury    77      73      2500
chevy   nova    79      60      3000
ford    mustang 65      45      17000
volvo   gl      78      102     9850
ford    ltd     83      15      10500
Chevy   nova    80      50      3500
fiat    600     65      115     450
honda   accord  81      30      6000
ford    thundbd 84      10      17000
toyota  tercel  82      180     750
chevy   impala  65      85      1550
ford    bronco  83      25      9525

then egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt will result in the output:

plym    fury    77      73      2500
chevy   nova    79      60      3000
volvo   gl      78      102     9850
Chevy   nova    80      50      3500
fiat    600     65      115     450
honda   accord  81      30      6000
toyota  tercel  82      180     750
chevy   impala  65      85      1550

6.0.1 breakdown:

^(the whole expression).*$ means all lines that:

do NOT match the whole expression
followed by any character .
zero or more times *
up to the end of the line $

the whole expression:

can't have an f OR
can have an f as long as it is NOT followed by an o OR
can have an fo as long as it is NOT followed by an r

egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt

Of course the obvious, easy approach is to use a flag, specifically -v which states take the negative of the following expression.

So egrep -v "ford" cars.txt will do it.