Now the basics of using regular expression in Python via the re
module
have been explained, I can get into the actual writing of regular
expressions.
The simplest regular expression is a string of characters, which
describes a pattern consisting of exactly that string of characters. You
may also describe a range of characters using square brackets [
and
]
. For instance, the regular expression [aeiou]
describes any of the
characters "a"
, "e"
, "i"
, "o"
, or "u"
. This means that if
[aeiou]
is part of a regular expression, at that location in the
pattern one of these letters must reside (note: exactly one of them, so
not multiple). For instance, to search for the words "ball"
, "bell"
,
"bill"
, "boll"
and "bull"
, the regular expression b[aeiou]ll
can
be used.
import re
slist = re.findall( r"b[aeiou]ll", "Bill Gates and Uwe Boll \
drank Red Bull at a football match in Campbell." )
print( slist )
Change the regular expression above so that it not only finds the words “ball” and “bell”, but also “Bill”, “Boll”, and “Bull”.
You can use a dash within the square brackets between two characters to
indicate that they represent not only these two characters, but also all
the characters in between. For instance, the regular expression
[a-dqx-z]
is equivalent to [abcdqxyz]
. To describe any of the
letters of the alpabet, either as capital or lower case, you can use
[A-Za-z]
.
Moreover, if you place a caret (^
) right next to the opening square
bracket, that means that you want the opposite of what is within the
square brackets. For instance, [^0-9]
indicates any character except
for a digit.
In a regular expression, just like in strings, the backslash character
(\
) indicates that the character that follows it has a special meaning,
i.e., it is an escape sequence. The escape sequences that hold for
strings also hold for regular expressions, but regular expressions have
many more. There are also a few meta-characters that are interpreted in
a particular way. The following special sequences are defined (there are
more, but these are the most common ones):
symbol | meaning |
---|---|
\b |
word boundary (zero-width) |
\B |
not a word boundary (zero-width) |
\d |
digit [0-9] |
\D |
not a digit[^0-9] |
\n |
newline |
\r |
return |
\s |
whitespace (including tabulation) |
\S |
not a whitespace |
\t |
tabulation |
\w |
alphanumeric character [A-Za-z0-9_] |
\W |
not an alphanumeric character [^A-Za-z0-9_] |
\/ |
forward slash |
\\ |
backslash |
\" |
double quote |
\' |
single quote |
^ |
start of a string (zero-width) |
$ |
end of a string (zero-width) |
. |
any character |
Note that “zero-width” means that the sequence does not represent a
character, but a position in the string between two characters (or the
start or end of the string). For instance, the regular expression ^A
represents a string that starts with the letter "A"
.
Moverover, you can place characters or substrings between parentheses,
in which case the characters are “grouped.” Within a group, you can
indicate a choice between multiple (sequences of) characters by placing
pipe-lines (|
) between them. For instance, the regular expression
(apple|banana|orange)
is the string "apple"
or the string "banana"
or the string "orange"
.
You should be aware that some of these special sequences (in particular those without a backslash, the parentheses, and the pipe-line) do not work like indicated here when placed within square brackets. For instance, a period within square brackets does not mean “any character,” but an actual period.
Where regular patterns get really interesting is when repetitions are used. Several of the meta-characters are used to indicate that (part of) a regular expression is repeated multiple times. In particular, the following repetition operators are often used:
symbol | meaning |
---|---|
* |
zero or more times |
+ |
one or more times |
? |
zero or one time |
{p,q} |
at least p and at most q times |
{p,} |
at least p times |
{p} |
exactly p times |
You place such an operator after the (part of the) expression it
repeats. For instance, ab*c
means the letter "a"
, followed by zero
or more times the letter "b"
, followed by the letter "c"
. Thus, it
matches the strings "ac"
, "abc"
, "abbc"
, "abbbc"
, "abbbbc"
,
etcetera.
A repetition operator after a group (between parentheses) indicates the
repetition of the whole group. For instance, (ab)*c
matches the
strings "c"
, "abc"
, "ababc"
, "abababc"
, "ababababc"
, etcetera.
Regular expression matching for repetitions is greedy. It will always try to match the earliest occurring pattern first, extended to its longest possible extension. For example:
import re
mlist = re.finditer(r"ba+","A sheep says 'baaaaah' to Ali Baba.")
for m in mlist:
print( "{} is found at {}.".format(m.group(), m.start()))
Change the regular expression in the code above so that it finds any
"b"
followed by one or more "a"
s, where the "b"
might be
captitalized. The output should be "baaaaa"
, "Ba"
and "ba"
.
Once you have solved the previous exercise, change the regular
expression so that it finds the pattern consisting of a "b"
or "B"
followed by a sequence of one or more "a"
s, repeated one or more
times. The output should be "baaaaa"
and "Baba"
. You will need to
use parentheses for this. When you think that your regular expression is
correct, also test it on several other strings.
Here is another one, which searches for occurrences of one or more
"a"
s:
import re
mlist = re.finditer(r"a+","A sheep says 'baaaaah' to Ali Baba.")
for m in mlist:
print( "{} is found at {}.".format(m.group(), m.start()))
When you run this code, you see that it finds four occurrences of the
pattern: three times a single "a"
, and one time a sequence of five
"a"
s. You might wonder why the pattern matching process does not also
find the four "a"
s starting at position 16, the three "a"
s starting
at position 17, the two "a"
s starting at position 18, and the single
"a"
starting at position 19. The reason is that the finditer()
and
findall()
methods, when they find a match, continue searching
immediately after the end of the last found match. Normally, this is the
behavior that you want.
Now change the r"a+"
in the code above to r"a*"
, which changes it to
searching for zero or more "a"
s. Before running the code, think about
what you expect the outcome to be. Then run the code and see if your
prediction was correct. If it wasn’t, do you now realize why the outcome
is what it is?
You may have noticed that regular expressions may become overly complex fast. It is a good idea to comment them so that you can understand them on later examination.
With all you learned until now, you should be able to do the following exercise. It is wise to solve this one before continuing with the remainder of this chapter. The exercise consists of a piece of code that you have to complete.
When you run the code below, it tries to search for all the regular
expressions in relist
, in all the strings in slist
. It prints for
each string the numbers of all the regular expressions for which matches
are found. Your goal is to fill in the regular expressions in relist
according to the specification in the comments to the right of each
expression. Note that the first seven regular expressions need to cover
the string as a whole, so you should have them start with a caret and
end with a dollar sign, which indicates that the expression should match
the string from the start to the end.
import re
# List of strings used for testing.
slist = [ "aaabbb", "aaaaaa", "abbaba", "aaa", "gErbil ottEr",
"tango samba rumba", " hello world ", " Hello World " ]
# List of regular expressions to be completed by the student.
relist = [
r"", # 1. Only a's followed by only b's, including ""
r"", # 2. Only a's, including ""
r"", # 3. Only a's and b's, in any order, including ""
r"", # 4. Exactly three a's
r"", # 5. Neither a's nor b's, but "" allowed
r"", # 6. An even number of a's (and nothing else)
r"", # 7. Exactly two words, regardless of white spaces
r"", # 8. Contains a word that ends in "ba"
r"" # 9. Contains a word that starts with a capital
]
for s in slist:
print( s, ':', sep='', end=' ' )
for i in range( len( relist ) ):
m = re.search( relist[i], s )
if m:
print( i+1, end=' ' )
print()
The correct output is:
aaabbb: 1 3
aaaaaa: 1 2 3 6
abbaba: 3 8
aaa: 1 2 3 4
bEver ottEr: 7
tango samba rumba: 8
hello world : 5 7
Hello World : 5 7 9
Make sure that you can do all of these correctly before you continue!