As shown above, when parentheses are used in regular expressions, they
create so-called “groups.” Take for instance the regular expression
(\d{1,2})-(\d{1,2})-(\d{4})
, which describes a sequence that could
represent a date: one or two digits, followed by a dash, followed by one
or two digits, followed by a dash, followed by four digits (if you do
not understand this regular expression, check back in previous sections
of this chapter until you do understand it). This expression contains
three groups: the first containing one or two digits, the second
containing one or two digits, and the third one containing the four
digits at the end. The code below searches for this pattern in a string.
import re
pDate = re.compile( r"(\d{1,2})-(\d{1,2})-(\d{4})" )
m = pDate.search( "In response to your letter of 25-3-2015, \
I decided to hire a hitman to get you." )
if m:
print( "Date {}; day {}; month {}; year {}".format(
m.group(0), m.group(1), m.group(2), m.group(3) ) )
When you run the code, you see that it not only gets out the result as a
whole (using the method group()
or group(0)
), but that you can also
access each of the groups that is found in the result, using methods
group(1)
for the day, group(2)
for the month, and group(3)
for the
year. You can also use the method groups()
to get a tuple with all the
groups.
findall()
and groupsThe findall()
methods returns a list of pattern objects. In the
examples where it was used until now, it returned a list of strings.
Indeed, pattern objects are strings if there is at most one group in the
regular expression. If there are multiple groups, pattern objects are
actually tuples that contain all the groups.
import re
pDate = re.compile( r"(\d{1,2})-(\d{1,2})-(\d{4})" )
datelist = pDate.findall( "In response to your letter of \
25-3-2015, on 27-3-2015 I decided to hire a hitman to get you." )
for date in datelist:
print( date )
It is possible to give each group a name, by placing the construct
?P<name>
(where you replace “name” with the name you want the group to
have – you leave the <
and >
in the expression in this case)
immediately after the opening parenthesis. You can then refer to the
groups by these names, instead of their index.
import re
pDate = re.compile(
r"(?P<day>\d{1,2})-(?P<month>\d{1,2})-(?P<year>\d{4})")
m = pDate.search( "In response to your letter of 25-3-2015, \
I curse you." )
if m:
print( "day is {}".format( m.group('day') ) )
print( "month is {}".format( m.group('month') ) )
print( "year is {}".format( m.group('year') ) )
Suppose that you have to create a regular expression that represents a
string that contains an arbitrary non-space character twice. For
instance, the string “regular” would not have a match, but the string
“expression” would (as it contains two "e"
s and two "s"
s). This
cannot be done with the regular expression features that we discussed
until now. It can be solved, however, with groups, and special
references within a regular expression, namely as follows: using the
special sequence \x
, whereby x
is a number, you refer to the group
with index x
in the match. Thus, a regular expression that represents
a string with an arbitrary non-space character twice is (\S).*\1
.
Since at this point this regular expression might still be a bit hard to
understand, let’s look at it in depth. The \S
is a special sequence
that represents a non-space character. Putting it in parentheses turns
it into a group, and since this is the first (and only) group in the
expression, its index is 1. The .*
represents a sequence of zero or
more characters, which can be anything (the period is a meta-character
that represents any character). Finally, the \1
refers to the first
group, and says that here you want to have exactly the same thing as the
first group represents. If you are wondering why you do not need to
represent anything that can be placed before the \S
, or anything that
can come after the \1
, then the answer is that you are not specifying
that this regular expression represents a string as a whole, so as long
as it occurs anywhere in the string, it matches.
Test this pattern with the code below, by replacing the string
"Monty Python's Flying Circus"
with different strings, and running the
code to examine the results.
import re
m = re.search( r"(\S).*\1", "Monty Python's Flying Circus" )
if m:
print( "The character {} occurs twice".format( m.group(1) ) )
else:
print( "No match was found." )
Can you change the regular expression in the code above so that it checks if the string contains a character at least three times?
Can you change the regular expression so that it checks whether it
contains at least two characters twice? This is quite hard and therefore
optional, but if you try to do it, make sure that you test it with at
least the strings "aaaa"
, "aabb"
, "abab"
and "abba"
. These all
should match, unless you also want the two repeated characters
different, then "aaaa"
should not match (but note that that makes the
regular expression even harder to design).