GEOG 489
Advanced Python Programming for GIS

3.3 Regular expressions

PrintPrint

To start off Lesson 3, we want to talk about a situation that you regularly encounter in programming: Often you need to find a string or all strings that match a particular pattern among a given set of strings.

For instance, you may have a list of names of persons and need all names from that list whose last name starts with the letter ‘J’. Or, you want do something with all files in a folder whose names contain the sequence of numbers “154” and that have the file extension “.shp”. Or, you want to find all occurrences where the word “red” is followed by the word “green” with at most two words in between in a longer text.

Support for these kinds of matching tasks is available in most programming languages based on an approach for denoting string patterns that is called regular expressions.

A regular expression is a string in which certain characters like '.', '*', '(', ')', etc. and certain combinations of characters are given special meanings to represent other characters and sequences of other characters. Surely you have already seen the expression “*.txt” to stand for all files with arbitrary names but ending in “.txt”.

To give you another example before we approach this topic more systematically, the following regular expression “a.*b” in Python stands for all strings that start with the character ‘a’ followed by an arbitrary sequence of characters, followed by a ‘b’. The dot here represents all characters and the star stands for an arbitrary number of repetitions. Therefore, this pattern would, for instance, match the strings 'acb', 'acdb', 'acdbb', etc.

Regular expressions like these can be used in functions provided by the programming language that, for instance, compare the expression to another string and then determine whether that string matches the pattern from the regular expression or not. Using such a function and applying it to, for example, a list of person names or file names allows us to perform some task only with those items from the list that match the given pattern.

In Python, the package from the standard library that provides support for regular expressions together with the functions for working with regular expressions is simply called “re”. The function for comparing a regular expression to another string and telling us whether the string matches the expression is called match(...). Let’s create a small example to learn how to write regular expressions. In this example, we have a list of names in a variable called personList, and we loop through this list comparing each name to a regular expression given in variable pattern and print out the name if it matches the pattern.

import re 

personList = [ 'Julia Smith', 'Francis Drake', 'Michael Mason',  
                'Jennifer Johnson', 'John Williams', 'Susanne Walker',  
                'Kermit the Frog', 'Dr. Melissa Franklin', 'Papa John', 
                'Walter John Miller', 'Frank Michael Robertson', 'Richard Robertson', 
                'Erik D. White', 'Vincent van Gogh', 'Dr. Dr. Matthew Malone', 
                'Rebecca Clark' ] 

pattern = "John" 

for person in personList: 
    if re.match(pattern, person): 
        print(person) 
Output:  

John Williams

Before we try out different regular expressions with the code above, we want to mention that the part of the code following the name list is better written in the following way:

pattern = "John" 

compiledRE = re.compile(pattern) 

for person in personList: 
    if compiledRE.match(person): 
        print(person) 

Whenever we call a function from the “re” module like match(…) and provide the regular expression as a parameter to that function, the function will do some preprocessing of the regular expression and compile it into some data structure that allows for matching strings to that pattern efficiently. If we want to match several strings to the same pattern, as we are doing with the for-loop here, it is more time efficient to explicitly perform this preprocessing and store the compiled pattern in a variable, and then invoke the match(…) method of that compiled pattern. In addition, explicitly compiling the pattern allows for providing additional parameters, e.g. when you want the matching to be done in a case-insensitive manner. In the code above, compiling the pattern happens in line 3 with the call of the re.compile(…) function and the compiled pattern is stored in variable compiledRE. Instead of the match(…) function, we now invoke the method match(…) of the compiled pattern object in variable person (line 6) that only needs one parameter, the string that should be matched to the pattern. Using this approach, the compilation of the pattern only happens once instead of once for each name from the list as in the first version.

One important thing to know about match(…) is that it always tries to match the pattern to the beginning of the given string but it allows for the string to contain additional characters after the entire pattern has been matched. That is the reason why when running the code above, the simple regular expression “John” matches “John Williams” but neither “Jennifer Johnson”, “Papa John”, nor “Walter John Miller”. You may wonder how you would then ever write a pattern that only matches strings that end in a certain sequence of characters? The answer is that Python's regular expressions use the special characters ^ and $ to represent the beginning or the end of a string and this allows us to deal with such situations as we will see a bit further below.

Now let’s have a look at the different special characters and some examples using them in combination with the name list code from above. Here is a brief overview of the characters and their purpose:

Special Characters and Their Purpose
Character Purpose
. stands for a single arbitrary character
[ ] are used to define classes of characters and match any character of that class 
( ) are used to define groups consisting of multiple characters in a sequence 
+ stands for arbitrarily many repetitions of the previous character or group but at least one occurrence 
* stands for arbitrarily many repetitions of the previous character or group including no occurrence
? stands for zero or one occurrence of the previous character or group, so basically says that the character or group is optional 
{m,n} stands for at least m and at most n repetitions of the previous group where m and n are integer numbers  
^ stands for the beginning of the string 
$ stands for the end of the string 
| stands between two characters or groups and matches either only the left or only the right character/group, so it is used to define alternatives  
\ is used in combination with the next character to define special classes of characters  

Since the dot stands for any character, the regular expression “.u” can be used to get all names that have the letter ‘u’ as the second character. Give this a try by using “.u” for the regular expression in line 1 of the code from the previous example.

pattern = ".u"

The output will be:

Julia Smith 
Susanne Walker 

Similarly, we can use “..cha” to get all names that start with two arbitrary characters followed by the character sequence resulting in “Michael Mason” and “Richard Robertson” being the only matches. By the way, it is strongly recommended that you experiment a bit in this section by modifying the patterns used in the examples. If in some case you don’t understand the results you are getting, feel free to post this as a question on the course forums.

Maybe you are wondering how one would use the different special characters in the verbatim sense, e.g. to find all names that contain a dot. This is done by putting a backslash in front of them, so \. for the dot, \? for the question mark, and so on. If you want to match a single backslash in a regular expression, this needs to be represented by a double backslash in the regular expression. However, one has to be careful here when writing this regular expression as a string literal in the Python code: Because of the string escaping mechanism, a sequence of two backslashes will only produce a single backslash in the string character sequence. Therefore, you actually have to use four backslashes, "xyz\\\\xyz" to produce the correct regular expression involving  a single backslash. Or you use a raw string in which escaping is disabled, so r"xyz\\xyz". Here is one example that uses \. to search for names with a dot as the third character returning “Dr. Melissa Franklin” and “Dr. Dr. Matthew Malone” as the only results:

pattern = "..\." 

Next, let us combine the dot (.) with the star (*) symbol that stands for the repetition of the previous character. The pattern “.*John” can be used to find all names that contain the character sequence “John”. The .* at the beginning can match any sequence of characters of arbitrary length from the . class (so any available character). For Instance, for the name “Jennifer Johnson”, the .* matches the sequence “Jennifer “ produced from nine characters from the . class and since this is followed by the character sequence “John”, the entire name matches the regular expression.

pattern = ".*John"
Output: 

Jennifer Johnson 
John Williams 
Papa John 
Walter John Miller 

Please note that the name “John Williams” is a valid match because the * also includes zero occurrences of the preceding character, so “.*John” will also match “John” at the beginning of a string.

The dot used in the previous examples is a special character for representing an entire class of characters, namely any character. It is also possible to define your own class of characters within a regular expression with the help of the squared brackets. For instance, [abco] stands for the class consisting of only the characters ‘a’, ‘b’,‘c’ and ‘o’. When it is used in a regular expression, it matches any of these four characters. So the pattern “.[abco]” can, for instance, be used to get all names that have either ‘a’, ‘b’, ‘c’ or ‘o’ as the second character. This means using ...

pattern = ".[abco]" 

... we get the output:

John Williams 
Papa John 
Walter John Miller 

When defining classes, we can make use of ranges of characters denoted by a hyphen. For instance, the range m-o stands for the lower-case characters ‘m’, ‘n’, ‘o’ . The class [m-oM-O.] would then consist of the characters ‘m’, ‘n’, ‘o’, ‘M’, ‘N’, ‘O’, and ‘.’ . Please note that when a special character appears within the squared brackets of a class definition (like the dot in this example), it is used in its verbatim sense. Try this idea of using ranges out with the following example:

pattern = "......[m-oM-O.]" 

The output will be...

Papa John 
Frank Michael Robertson 
Erik D. White 
Dr. Dr. Matthew Malone 

… because these are the only names that have a character from the class [m-oM-O.] as the seventh character.

In addition to the dot, there are more predefined classes of characters available in Python for cases that commonly appear in regular expressions. For instance, these can be used to match any digit or any non-digit. Predefined classes are denoted by a backslash followed by a particular character, like \d for a single decimal digit, so the characters 0 to 9. The following table lists the most important predefined classes:

Predefined Character Classes
Predefined class Description
\d stands for any decimal digit 0…9 
\D stands for any character that is not a digit  
\s stands for any whitespace character (whitespace characters include the space, tab, and newline character)  
\S stands for any non-whitespace character 
\w stands for any alphanumeric character (alphanumeric characters are all Latin letters a-z and A-Z, Arabic digits 0…9, and the underscore character) 
\W stands for any non-alphanumeric character 

To give one example, the following pattern can be used to get all names in which “John” appears not as a single word but as part of a longer name (either first or last name). This means it is followed by at least one character that is not a whitespace which is represented by the \S in the regular expression used. The only name that matches this pattern is “Jennifer Johnson”.

pattern = ".*John\S" 

In addition to the *, there are more special characters for denoting certain cases of repetitions of a character or a group. + stands for arbitrarily many occurrences but, in contrast to *, the character or group needs to occur at least once. ? stands for zero or one occurrence of the character or group. That means it is used when a character or sequence of characters is optional in a pattern. Finally, the most general form {m,n} says that the previous character or group needs to occur at least m times and at most n times.

If we use “.+John” instead of “.*John” in an earlier example, we will only get the names that contain “John” but preceded by one or more other characters.

pattern = ".+John" 
Output: 

Jennifer Johnson 
Papa John 
Walter John Miller 

By writing ...

pattern = ".{11,11}[A-Z]" 

... we get all names that have an upper-case character as the 12th character. The result will be “Kermit the Frog”. This is a bit easier and less error-prone than writing “………..[A-Z]”.

Lastly, the pattern “.*li?a” can be used to get all names that contain the character sequences ‘la’ or ‘lia’.

pattern = ".*li?a"
Output: 

Julia Smith 
John Williams 
Rebecca Clark 

So far we have only used the different repetition matching operators *, +, {m,n}, and ? for occurrences of a single specific character. When used after a class, these operators stand for a certain number of occurrences of characters from that class. For instance, the following pattern can be used to search for names that contain a word that only consists of lower-case letters (a-z) like “Kermit the Frog” and “Vincent van Gogh”. We use \s to represent the required whitespaces before and after the word and then [a-z]+ for an arbitrarily long sequence of lower-case letters but consisting of at least one letter.

pattern = ".*\s[a-z]+\s" 

Sequences of characters can be grouped together with the help of parentheses (…) and then be followed by a repetition operator to represent a certain number of occurrences of that sequence of characters. For instance, the following pattern can be used to get all names where the first name starts with the letter ‘M’ taking into account that names may have a ‘Dr. ’ as prefix. In the pattern, we use the group (Dr.\s) followed by the ? operator to say that the name can start with that group but doesn’t have to. Then we have the upper-case M followed by .*\s to make sure there is a white space character later in the string so that we can be reasonably sure this is the first name.

pattern = "(Dr.\s)?M.*\s"
Output: 

Michael Mason 
Dr. Melissa Franklin 

You may have noticed that there is a person with two doctor titles in the list whose first name also starts with an ‘M’ and that it is currently not captured by the pattern because the ? operator will match at most one occurrence of the group. By changing the ? to a * , we can match an arbitrary number of doctor titles.

pattern = "(Dr.\s)*M.*\s" 
Output: 

Michael Mason 
Dr. Melissa Franklin 
Dr. Dr. Matthew Malone 

Similar to how we have the if-else statement to realize case distinctions in addition to loop based repetitions in normal Python, regular expression can make use of the | character to define alternatives. For instance, (nn|ss) can be used to get all names that contain either the sequence “nn” or the sequence “ss” (or both):

pattern = ".*(nn|ss)" 
Output: 

Jennifer Johnson 
Susanne Walker 
Dr. Melissa Franklin 

As we already mentioned, ^ and $ represent the beginning and end of a string, respectively. Let’s say we want to get all names from the list that end in “John”. This can be done using the following regular expression:

pattern = ".*John$" 
Output: 

Papa John 

Here is a more complicated example. We want all names that contain “John” as a single word independent of whether “John” appears at the beginning, somewhere in the middle, or at the end of the name. However, we want to exclude cases where “John” appears as part of longer word (like “Johnson”). A first idea could be to use ".*\sJohn\s" to achieve this making sure that there are whitespace characters before and after “John”. However, this will match neither “John Williams” nor “Papa John” because the beginning and end of the string are not whitespace characters. What we can do is use the pattern "(^|.*\s)John" to say that John needs to be preceded either by the beginning of the string or an arbitrary sequence of characters followed by a whitespace. Similarly, "John(\s|$)" requires that John is succeeded either by a whitespace or by the end of the string. Taken together we get the following regular expressions:

pattern = "(^|.*\s)John(\s|$)" 
Output: 

John Williams 
Papa John 
Walter John Miller 

An alternative would be to use the regular expression "(.*\s)?John(\s.*)?$" That uses the optional operator ? rather than | . There are often several ways to express the same thing in a regular expression. Also, as you start to see here, the different special matching operators can be combined and nested to form arbitrarily complex regular expression. You will practice writing regular expressions like this a bit more in the practice exercises and in the homework assignment.

In addition to the main special characters we explained in this section, there are certain extension operators available denoted as (?x...) where the x can be one of several special characters determining the meaning of the operator. We here just briefly want to mention the operator (?!...) for negative lookahead assertion because we will use it later in the lesson's walkthrough to filter files in a folder. Negative lookahead extension means that what comes before the (?!...) can only be matched if it isn't followed by the expression given for the ... . For instance, if we want to find all names that contain John but not followed by "son" as in "Johnson", we could use the following expression:

pattern = ".*John(?!son)" 
Output: 

John Williams
Papa John
Walter John Miller

If match(…) does not find a match, it will return the special value None. That’s why we can use it with an if-statement as we have been doing in all the previous examples. However, if a match is found it will not simply return True but a match object that can be used to get further information, for instance about which part of the string matched the pattern. The match object provides the methods group() for getting the matched part as a string, start() for getting the character index of the starting position of the match, end() for getting the character index of the end position of the match, and span() to get both start and end indices as a tuple. The example below shows how one would use the returned matching object to get further information and the output produced by its four methods for the pattern “John” matching the string “John Williams”:

pattern = "John" 
compiledRE = re.compile(pattern) 

for person in personList: 
     match = compiledRE.match(person) 
     if match: 
         print(match.group()) 
         print(match.start()) 
         print(match.end()) 
         print(match.span()) 
Output: 

John      <- output of group() 
0         <- output of start() 
4         <- output of end() 
(0,4)     <- output of span() 

In addition to match(…), there are three more matching functions defined in the re module. Like match(…), these all exist as standalone functions taking a regular expression and a string as parameters, and as methods to be invoked for a compiled pattern. Here is a brief overview:

  • search(…) - In contrast to match(…), search(…) tries to find matching locations anywhere within the string not just matches starting at the beginning. That means “^John” used with search(…) corresponds to “John” used with match(…), and “.*John” used with match(…) corresponds to “John” used with search(…). However, “corresponds” here only means that a match will be found in exactly the same cases but the output by the different methods of the returned matching object will still vary.
  • findall(…) - In contrast to match(…) and search(…), findall(…) will identify all substrings in the given string that match the regular expression and return these matches as a list.
  • finditer(…)finditer(…) works like findall(…) but returns the matches found not as a list but as a so-called iterator object.

By now you should have enough understanding of regular expressions to cover maybe ~80 to 90% of the cases that you encounter in typical programming. However, there are quite a few additional aspects and details that we did not cover here that you potentially need when dealing with rather sophisticated cases of regular expression based matching. The full documentation of the “re” package can be found here and is always a good source for looking up details when needed. In addition, this HOWTO provides a good overview.

We also want to mention that regular expressions are very common in programming and matching with them is very efficient, but they do have certain limitations in their expressivity. For instance, it is impossible to write a regular expression for names with the first and last name starting with the same character. Or, you cannot define a regular pattern for all strings that are palindromes, so words that read the same forward and backward. For these kinds of patterns, certain extensions to the concept of a regular expression are needed. One generalization of regular expressions are what are called recursive regular expressions. The regex Python package currently under development, backward compatible to re, and planned to replace re at some point, has this capability, so feel free to check it out if you are interested in this topic.