Zintis Regular Expression Cheat
1 Testing your regexp patterns
Two sites exist that make testing your regular expressions easy:
2 Patterns
2.1 Boolean 'or'
vertical bar gray|grey
2.1.1 You can use the 'or' in grep like this:
grep 'gray\|grey' *.txt
If you want to find all the lines that have BOTH strings on a line, try this
grep 'string1.*string2\|string2.*string1' *.py
2.2 Grouping
Parenetheses used to scope and set precedence of operators
gr(a|e)y
is the same as gray|grey
2.3 Quantification
A quantifier specifies how often the preceding element is allowed to occur
?
zero or one*
zero or more+
one or more{n}
preceeding item is matched exactly n times{min,}
preceeding item is matched min or more times{,max}
preceeding item is matched not more than max times{max,}
This, I believe, is a typo. Tt should be{,max}
as above.{min,max}
preceeding item is matched min times but not more than max times.
wildcard matches any single character
2.4 Quantification examples:
Char | Meaning | Example | Resultant Match |
---|---|---|---|
. | wildcard char | a.b | axb, ayb, azb, aWb, … |
* | zero or more | ab*c | ac, abc, abbc, abbbc… |
.* | wildcard star | a.*b | "a" and "b" at some later point |
? | zero or one | colou?r | color and colour |
*? | wildcard star | a.*?b | non-greedy |
+ | one or more | ab+c | abc abbc, abbbc… |
+? | one or more | ab+c | same but non greedy |
{n} | exactly n times | ab{6}c | abbbbbbc |
{n,} | n or more times | ab{6,}c | abbbbbbbbbbc |
{,n} | n or less times | ab{6}c | abbbc |
{n} | exactly n times | ab{6}c | abbbbbbc |
{n,m} | between n and | ab{2,4}c | abbc or abbbc or abbbbc |
m times |
2.5 Metacharacters
Char | Meaning | Example | Resultant Match |
---|---|---|---|
^ | starting | ^Hello | Hello (only at the beginning of the line |
$ | end of strg | xray$ | matches 'xray' at the END of the line |
. | wildcard | a.c | matches any character |
abc unless in [] when it is a 'dot' | |||
eg. [a.c] ONLY matches 'a', '.', or 'c' | |||
[ ] | single char | [abc] | matches a single character in the listed |
set . i.e. ONLY 'a', 'b', or 'c' |
|||
[a-f] | single char | [q-t] | matches ONLY 'q', 'r', 's' or 't' |
in range | |||
[a-z] | lower case | matches a single lower case letter | |
[A-Z] | upper case | matches a singlee upper case letter | |
( ) | sub-expression | (cat) | 'cat' can be recalled later |
called an extraction using $1 for the |
|||
first one, $2 the 2nd, etc. | |||
( ) | also a group | (red) | the letters must appear contigously |
red matches | |||
ordered matches | |||
order does NOT match | |||
? | modifier | Modifie the *, +, ? to match as few | |
times as possible |
2.6 Examples
Regexp | Matches | Does NOT Match |
---|---|---|
.at | cat, bat, hat, mat, rat | |
[hc]at | cat, hat | bat, mat, rat |
[b]at | cat, hat, mat, rat | bat |
^bat | bat (ONLY at the beginning of line | bat (anywhere else) |
bat$ | bat (ONY at the end of the line) | bat (anywhere else) |
catordog | 'cat' or 'dog' vertical bar here pls | bird |
2.7 Character classes
- [\x00-\x7f] ASCII characters
- [A-Za-z0-9] Alphanumeric characters
- [A-Za-z0-9_] Alphanumeric characters plus '_'
- [A-Za-z0-9_] non-word characters
- [A-Za-z] Alphabetic characters
- [ \t] space and tab
- (?<=\W)(?=\w)|?<=\w)(?=\W) word boundaries
- (?<=\W)(?=\W)|?<=\w)(?=\w) non-word boundaries
there are more
3 Common Regular Expressions
Match a [ by escaping it with a \ so matching [5] woule be \[5\]
ipv4 addresses:
([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3}) but you do not need the round brackets around [0-9]{1,3} [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} this can be rewritten as: ([0-9]{1,3}\.){3}[0-9]{1,3} if there there are spaces before and after the address: \b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b But that would match address lid 787.888.910.666 so we can be more accurate: \b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}↵ (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
alias addr="ip addr | egrep '([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})'"
3.1 regexp for multiple spaces.
If you're looking for a space, that would be " " (one space).
If you're looking for one or more, it's
" *"
(that's two spaces and an asterisk)
or
" +"
(one space and a plus).
3.2 specialty regexp for multiple spaces
Here is a everything you need to know about whitespace in regular expressions:
:blank: Space or tab only # this is actaully bra-bra-doubledot-blank-doubledot-ket-ket :space: Whitespace # same for this, except 'space' in place of 'blank' \s Any whitespace character \v Vertical whitespace \h Horizontal whitespace x Ignore whitespace
3.3 advanced spaces in regexp:
\s{2,} This matches all occurrences of one or more whitespace characters. If you need to match the entire line, but only if it contains two or more consecutive whitespace characters:
^.*\s{2,}.*$ If the whitespaces don't need to be consecutive:
^(.*\s.*){2,}$
4 regexp are greedy (unless ? follows * or +)
They will gobble up the biggest block that matches the expression
.
Regexp are by default greedy
so a regexp of ^.*MOS
by default will match
everything from the beginning of the line to the last MOS found. i.e.
Nowadays a chicken leg is a rare dish MOS1 : The depth of a well is hard to fathom MOS2
then ^.*MOS
will match:
Nowadays a chicken leg is a rare dish MOS1 : The depth of a well is hard to fathom MOS
But if your regexp is ^.*?MOS
it is not greedy and you would match the
first MOS it found. eg:
Nowadays a chicken leg is a rare dish MOS
So sed 's/^.*MOS//'
would get you 2
and sed 's/^.*?MOS//'
would get you 1 : The depth of a well is hard to fathom
MOS
5 Python import re
Regular expressions module, re, needs to be imported before use. It is part of the standard python library, so all pythons should have re.
Use is re.search()
to see if a string matches a regexp, similar to find()
method for strings
use re.findall()
to extract portiosn of a string that match the regexp, which
is similar to running find()
followed by slicing var[5:10]
5.1 Home
6 Sub-expressions ()
In the search part of a regex, expressions wrapped in brackets () are stored in variables that can be recalled using $1 for the first set of brackets, $2 for the 2nd, etc. The regex inside the brackets are called a "capturing group" They can be used in the replacement pattern of the grep or other utilities.
$n
includes the last substring matched by the nth capturing group${ name }
includes the last substring matched by the named group designated by {?<name> ) in the replacement string$$
includes a single literal "$" in the replacement string.$&
includes a copy fo the entire match in the replacement string$`
includes all the text of the input string before the match in the replacement string$'
includes all the text of the input string after the match in the replacement string$+
includes the last group captured in the replacement string$_
includes the entire input string in the replacement string
Example:
- String searched: "@gmail.com addr–"
- Regexp: (\w+)\W+(\w+)
$` –> "@" $& –> "gmail.com addr" $& –> $& –> $& –> $& –> $& –>
egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt
will match all llines that do NOT contain for
So file cars.txt
plym fury 77 73 2500 chevy nova 79 60 3000 ford mustang 65 45 17000 volvo gl 78 102 9850 ford ltd 83 15 10500 Chevy nova 80 50 3500 fiat 600 65 115 450 honda accord 81 30 6000 ford thundbd 84 10 17000 toyota tercel 82 180 750 chevy impala 65 85 1550 ford bronco 83 25 9525
then egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt
will result in the output:
plym fury 77 73 2500 chevy nova 79 60 3000 volvo gl 78 102 9850 Chevy nova 80 50 3500 fiat 600 65 115 450 honda accord 81 30 6000 toyota tercel 82 180 750 chevy impala 65 85 1550
6.0.1 breakdown:
^(the whole expression).*$
means all lines that:
- do NOT match the whole expression
- followed by any character
.
- zero or more times
*
- up to the end of the line
$
the whole expression:
- can't have an f OR
- can have an f as long as it is NOT followed by an o OR
- can have an fo as long as it is NOT followed by an r
egrep "^([^f]|f([^o]|o([^r]|$)|$)|$).*$" cars.txt
Of course the obvious, easy approach is to use a flag, specifically -v
which
states take the negative of the following expression.
So egrep -v "ford" cars.txt
will do it.