AI For Trading: Regular Expressions of NLP (86)

Regular Expressions(Regexes)

Introduction

In the following lessons we will learn how to create basic Regular Expressions in Python. Regular Expressions, also known as Regexes, are used to find different patterns of text. In general, regexes work by first specifying the rules for the set of possible patterns that you want to find and then making queries such as "Is this pattern found at the beginning of this string?" or “Is there a match for this pattern anywhere in this string?”. We will learn for example, how to write regular expressions to find phone numbers, names, and email addresses.

By the end this lesson you should be able to read and write basic regular expressions in Python and know how to apply them to get useful financial information from 10-Ks.

Raw Strings

Before we dive in and start creating our first regular expression, let's take a quick look at Raw Strings, since we will be using them to create our regexes.

In Python string literals are specified using either single quotes (') or double quotes ("); and the backslash (\) character is used to escape characters that have a special meaning, such as a newline (\n) or tab (\t). Let's see a simple example:

print('Hello\n\tWorld')
Hello
    World

We can clearly see that the print() function has replaced the \n with a new line, and the \t with a tab.

In some cases, however, you may want the print() function to interpret the string literally. This means that you don’t want characters preceded by a backslash (\) to be interpreted as special characters. In these cases, you can prefix the string literal with the letter r. Such strings are known as Raw Strings and treat backslashes (\) as literal characters. To see how this works, let's print the same string literal we had before but now as a raw string:

print(r'Hello\n\tWorld')
Hello\n\tWorld

We can clearly see that by adding an r before the first quote of the string literal, both \n and \t, are no longer treated as special characters. It is important to note, that the r doesn't change the type of the string literal, but rather, it just changes how the string literal is interpreted. So, without the r, backslashes are used to escape characters and with the r, backslashes are treated as literal characters.

We will be using raw strings to create our regular expressions, because regular expressions themselves, also use the backslash character (\) to indicate their own special characters. Therefore, by using raw strings, we avoid the problem of Python interpreting the special characters in regexes in the wrong way.

Finding Words

In this notebook we will learn how to find letters and words in a string using regular expressions. Throughout these lessons, we will use the re module from Python's standard library to work with regular expressions. The re module not only contains functions that allow us to check if a given regular expression matches a particular string, but also contains functions that allow us to modify strings in various ways.

Let’s begin by using a regular expression to find all the locations of a single letter in a given string. To do this, we will use the re.compile() function from the re module. The re.compile(pattern) function converts a regular expression pattern into a regular expression object. This allows us to save our regular expressions into objects that can be used later to perform pattern matching using various methods, such as .match(), .search(), .findall(), and .finditer(). Let’s see how this works.

In the code below, we will find all the locations of the letter a in a string named sample_text. In this case, our regular expression pattern will just be 'a' and we will pass it to the re.compile() function as a raw string. We will save the regular expression object returned by the re.compile() function in a variable called regex. We will then use the .finditer() method to search our sample_text for the given regular expression contained in the regex object. The .finditer() method returns an iterator with all the non-overlapping matches of our regular expression pattern in the string. We should also mention that the .finditer() method scans the strings from left-to-right, and returns the matches in the order found. Since the .finditer() method returns an iterator, we can loop through it to print all the matches, as shown below:

# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'a'
regex = re.compile(r'a')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(11, 12), match='a'>
<_sre.SRE_Match object; span=(17, 18), match='a'>
<_sre.SRE_Match object; span=(22, 23), match='a'>

We can see that each match corresponds to a Match Object with a given span and corresponding match. The span=(start,end) is a tuple that indicates the start and end indices of the given match in the string sample_text. For example, if we look at the span of the first match, we can see that the first a is located between indices 6 through 7. Therefore, if we print the sample_text string from index 6 through 7 we will see that it corresponds to the letter a:

# Print the sample_text string from index 6 through 7
print(sample_text[6:7])
a

Notice, however that even though the first letter in our sample_text is an uppercase A, the .finditer() method didn't return it as a match. This is because, regular expressions are case sensitive. Therefore, in order to match this uppercase A we will need to use 'A' as our regular expression, as shown below:

# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'A'
regex = re.compile(r'A')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(0, 1), match='A'>

Notice that now, the .finditer() method only returned one match, since there is only one uppercase A in our sample_text. Also, notice that the span=(0,1) tells us that the uppercase A is the first letter in the sample_text string.

We should note that the re module allows us to perform case-insensitive searches by the means of Flags. For example, we might want to search our string for the letter a, regardless if it is uppercase or lowercase. We will learn about flags in a later lesson.

Besides searching for single letter, we can also search for groups of letters. This is done in exactly the same manner as with single letters. Let's search for the word walking in our sample_text string:

# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'walking'
regex = re.compile(r'walking')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])
<_sre.SRE_Match object; span=(21, 28), match='walking'>

Match from the original text: walking

Notice that we only get one match, since there is only one instance of the word walking in our sample_text. Also, notice that in the above example we used the match.span() method to get the start and end indices of our match.

When using regular expressions to search for groups of letters, we should note that the order of the letters matters. For example, if we were to search for ginwakl in our sample_text, we wouldn't find any matches even though the same group of letters are contained in the word walking, as shown in the code below:

# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'ginwakl'
regex = re.compile(r'ginwakl')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

We can clearly see that there are no matches because the .finditer() method is looking for those letters in that particular order in our sample_text string.

TODO: Find Words

In the cell below, the sample_text string contains the name Walter Brown written in a mixture of uppercase and lowercase letters. Write a regular expression that matches the name WaLtEr BroWN and save the regular expression object in a variable called regex. Then use the .finditer() method to search the sample_text string for the given regular expression. Then, write a loop to print all the matches found by the .finditer() method . Finally, use the match.span() method to print the match from the sample_text string.

# import re module
import re

# Sample text
sample_text = 'Alice and WaLtEr BroWN are talking with wAlTer Jackson.'

# Create a regular expression object with the regular expression
# re.I 忽略大小写
regex = re.compile(r'Walter Brown', re.I)

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])
<_sre.SRE_Match object; span=(10, 22), match='WaLtEr BroWN'>

Match from the original text: WaLtEr BroWN

Matching a Period (.)

Now, let's use a regular expression to find the period (.) at the end of our sample_text string. Let's search for the period in the same manner as we did for single letters:

# import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression '.'
regex = re.compile(r'.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(0, 1), match='A'>
<_sre.SRE_Match object; span=(1, 2), match='l'>
<_sre.SRE_Match object; span=(2, 3), match='i'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='n'>
<_sre.SRE_Match object; span=(8, 9), match='d'>
<_sre.SRE_Match object; span=(9, 10), match=' '>
<_sre.SRE_Match object; span=(10, 11), match='W'>
<_sre.SRE_Match object; span=(11, 12), match='a'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='t'>
<_sre.SRE_Match object; span=(14, 15), match='e'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
<_sre.SRE_Match object; span=(16, 17), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='a'>
<_sre.SRE_Match object; span=(18, 19), match='r'>
<_sre.SRE_Match object; span=(19, 20), match='e'>
<_sre.SRE_Match object; span=(20, 21), match=' '>
<_sre.SRE_Match object; span=(21, 22), match='w'>
<_sre.SRE_Match object; span=(22, 23), match='a'>
<_sre.SRE_Match object; span=(23, 24), match='l'>
<_sre.SRE_Match object; span=(24, 25), match='k'>
<_sre.SRE_Match object; span=(25, 26), match='i'>
<_sre.SRE_Match object; span=(26, 27), match='n'>
<_sre.SRE_Match object; span=(27, 28), match='g'>
<_sre.SRE_Match object; span=(28, 29), match=' '>
<_sre.SRE_Match object; span=(29, 30), match='t'>
<_sre.SRE_Match object; span=(30, 31), match='o'>
<_sre.SRE_Match object; span=(31, 32), match=' '>
<_sre.SRE_Match object; span=(32, 33), match='t'>
<_sre.SRE_Match object; span=(33, 34), match='h'>
<_sre.SRE_Match object; span=(34, 35), match='e'>
<_sre.SRE_Match object; span=(35, 36), match=' '>
<_sre.SRE_Match object; span=(36, 37), match='s'>
<_sre.SRE_Match object; span=(37, 38), match='t'>
<_sre.SRE_Match object; span=(38, 39), match='o'>
<_sre.SRE_Match object; span=(39, 40), match='r'>
<_sre.SRE_Match object; span=(40, 41), match='e'>
<_sre.SRE_Match object; span=(41, 42), match='.'>

We can clearly see that something has gone wrong, the .finditer() method has matched every single character in the sample_text string, including whitespaces, uppercase and lowercase letters, and the period at the end.

This because, in regular expressions, the . is a special character known as a Metacharacter. Metacharacters are used to give special instructions and we will learn about them in the next lesson.

Finding MetaCharacters

# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression '\.'
regex = re.compile(r'\.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(41, 42), match='.'>

We can see that now, we have managed to find only the period (.) at the end of the sample_text string, as was intended.

Find All The MetaCharacters

# Import re module
import re

# Sample text
sample_text = '. ^ $ * + ? { } [ ] \ | ( )'

# Create a regular expression object with the regular expression 
regex = re.compile(r'\. \^ \$ \* \+ \? \{ \} \[ \] \\ \| \( \)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

输出:
<_sre.SRE_Match object; span=(0, 27), match='. ^ $ + ? { } [ ] \ | ( )'>
Match from the original text: . ^ $
+ ? { } [ ] \ | ( )

Find the price

# Import re module
import re

# Sample text
sample_text = 'John bought a winter coat for $25.99 dollars.'

# Create a regular expression object with the regular expression
regex = re.compile(r'\$25\.99')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

为者常成,行者常至