AI For Trading: Regular Expressions of NLP-2 (87)

2019-06-19 00:07:43 ⋅ 17010 ⋅ 0 ⋅ 0

Searching For Simple Patterns

The first metacharacter we are going to look at is the backslash (\). We already saw that the backslash can be used to escape all the metacharacters, so that you can search for them directly. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

\d - Matches any decimal digit; this is equivalent to the set [0-9]
\D - Matches any non-digit character; this is equivalent to the set [^0-9]
\s - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]
\S - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]
\w - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
\W - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while \d matches any digit, \D matches everything that is not a digit. Similarly, while \s matches any whitespace character, \S matches everything that is not a whitespace character; and while \w matches any alphanumeric character, \W matches everything that is not an alphanumeric character.

Let's start by learning how to use \d to search for decimal digits.

Matching Numbers Using `\d`

In the code below, we will use '\d' as our regular expression to find all the decimal digits in our sample_text string:

# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\d'
regex = re.compile(r'\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(15, 16), match='1'>
<_sre.SRE_Match object; span=(16, 17), match='2'>
<_sre.SRE_Match object; span=(17, 18), match='3'>
<_sre.SRE_Match object; span=(18, 19), match='0'>
<_sre.SRE_Match object; span=(46, 47), match='1'>
<_sre.SRE_Match object; span=(47, 48), match='5'>
<_sre.SRE_Match object; span=(48, 49), match='6'>
<_sre.SRE_Match object; span=(49, 50), match='7'>
<_sre.SRE_Match object; span=(50, 51), match='8'>
<_sre.SRE_Match object; span=(51, 52), match='9'>

As we can see, all the matches found above correspond to only decimal digits between 0 and 9.

Conversely, if wanted to find all the characters that are not decimal digits, we will use \D as our regular expression, as shown below:

# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\D'
regex = re.compile(r'\D')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='A'>
<_sre.SRE_Match object; span=(1, 2), match='l'>
<_sre.SRE_Match object; span=(2, 3), match='i'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(6, 7), match='l'>
<_sre.SRE_Match object; span=(7, 8), match='i'>
<_sre.SRE_Match object; span=(8, 9), match='v'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(10, 11), match='s'>
<_sre.SRE_Match object; span=(11, 12), match=' '>
<_sre.SRE_Match object; span=(12, 13), match='i'>
<_sre.SRE_Match object; span=(13, 14), match='n'>
<_sre.SRE_Match object; span=(14, 15), match=' '>
<_sre.SRE_Match object; span=(19, 20), match=' '>
<_sre.SRE_Match object; span=(20, 21), match='F'>
<_sre.SRE_Match object; span=(21, 22), match='i'>
<_sre.SRE_Match object; span=(22, 23), match='r'>
<_sre.SRE_Match object; span=(23, 24), match='s'>
<_sre.SRE_Match object; span=(24, 25), match='t'>
<_sre.SRE_Match object; span=(25, 26), match=' '>
<_sre.SRE_Match object; span=(26, 27), match='S'>
<_sre.SRE_Match object; span=(27, 28), match='t'>
<_sre.SRE_Match object; span=(28, 29), match='.'>
<_sre.SRE_Match object; span=(29, 30), match=','>
<_sre.SRE_Match object; span=(30, 31), match=' '>
<_sre.SRE_Match object; span=(31, 32), match='O'>
<_sre.SRE_Match object; span=(32, 33), match='c'>
<_sre.SRE_Match object; span=(33, 34), match='e'>
<_sre.SRE_Match object; span=(34, 35), match='a'>
<_sre.SRE_Match object; span=(35, 36), match='n'>
<_sre.SRE_Match object; span=(36, 37), match=' '>
<_sre.SRE_Match object; span=(37, 38), match='C'>
<_sre.SRE_Match object; span=(38, 39), match='i'>
<_sre.SRE_Match object; span=(39, 40), match='t'>
<_sre.SRE_Match object; span=(40, 41), match='y'>
<_sre.SRE_Match object; span=(41, 42), match=','>
<_sre.SRE_Match object; span=(42, 43), match=' '>
<_sre.SRE_Match object; span=(43, 44), match='M'>
<_sre.SRE_Match object; span=(44, 45), match='D'>
<_sre.SRE_Match object; span=(45, 46), match=' '>
<_sre.SRE_Match object; span=(52, 53), match='.'>

We can see that none of the matches are decimal digits. We also see, that by using \D we were able to match all characters, including periods (.) and white spaces.

TODO: Find IP Addresses

In the cell below, our sample_text string contains three IP addresses. Write a single regular expression that can match any IP address and save the regular expression object in a variable called regex. Then use the .finditer() method to search the sample_text string for the given regular expression. Finally, write a loop to print all the matches found by the .finditer() method.

HINT : Use the special sequence \d and take advantage that all IP addresses have the same pattern.

# Import re module
import re

# Sample text
sample_text = 'Here are three IP address: 123.456.789.123, 999.888.777.666, 111.222.333.444'

# Create a regular expression object with the regular expression
regex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(27, 42), match='123.456.789.123'>
<_sre.SRE_Match object; span=(44, 59), match='999.888.777.666'>
<_sre.SRE_Match object; span=(61, 76), match='111.222.333.444'>

If you wrote your regex correctly you should see three matches above corresponding to the three IP addresses in our sample_text string.

Matching Whitespace Characters Using `\s`

In the code below, we will use \s as our regular expression to find all the whitespace characters in our sample_text string. For this example, we will use a string literal that spans multiple lines. To create this multi-line string, we will use triple-quotes (''') both at the beginning and at the end of the multi-line string.

# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\s'
regex = re.compile(r'\s')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(1, 2), match='\t'>
<_sre.SRE_Match object; span=(7, 8), match=' '>
<_sre.SRE_Match object; span=(13, 14), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='\x0c'>
<_sre.SRE_Match object; span=(18, 19), match='\n'>
<_sre.SRE_Match object; span=(23, 24), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(33, 34), match='\r'>
<_sre.SRE_Match object; span=(34, 35), match='\n'>
<_sre.SRE_Match object; span=(40, 41), match=' '>
<_sre.SRE_Match object; span=(46, 47), match=' '>
<_sre.SRE_Match object; span=(49, 50), match=' '>
<_sre.SRE_Match object; span=(57, 58), match='\x0b'>
<_sre.SRE_Match object; span=(58, 59), match='\n'>

As we can see, all the matches found correspond to white spaces, tabs (\t), newlines (\n), carriage returns (\r), form feeds (\f), and vertical tabs (\v). Notice that form feeds appear as \x0c and vertical tabs as \x0b.

Conversely, if wanted to find all the characters that are not whitespace characters, we will use \S as our regular expression, as shown below:

# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\S'
regex = re.compile(r'\S')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(2, 3), match='A'>
<_sre.SRE_Match object; span=(3, 4), match='l'>
<_sre.SRE_Match object; span=(4, 5), match='i'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='e'>
<_sre.SRE_Match object; span=(8, 9), match='l'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='v'>
<_sre.SRE_Match object; span=(11, 12), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='s'>
<_sre.SRE_Match object; span=(14, 15), match='i'>
<_sre.SRE_Match object; span=(15, 16), match='n'>
<_sre.SRE_Match object; span=(16, 17), match=':'>
<_sre.SRE_Match object; span=(19, 20), match='1'>
<_sre.SRE_Match object; span=(20, 21), match='2'>
<_sre.SRE_Match object; span=(21, 22), match='3'>
<_sre.SRE_Match object; span=(22, 23), match='0'>
<_sre.SRE_Match object; span=(24, 25), match='F'>
<_sre.SRE_Match object; span=(25, 26), match='i'>
<_sre.SRE_Match object; span=(26, 27), match='r'>
<_sre.SRE_Match object; span=(27, 28), match='s'>
<_sre.SRE_Match object; span=(28, 29), match='t'>
<_sre.SRE_Match object; span=(30, 31), match='S'>
<_sre.SRE_Match object; span=(31, 32), match='t'>
<_sre.SRE_Match object; span=(32, 33), match='.'>
<_sre.SRE_Match object; span=(35, 36), match='O'>
<_sre.SRE_Match object; span=(36, 37), match='c'>
<_sre.SRE_Match object; span=(37, 38), match='e'>
<_sre.SRE_Match object; span=(38, 39), match='a'>
<_sre.SRE_Match object; span=(39, 40), match='n'>
<_sre.SRE_Match object; span=(41, 42), match='C'>
<_sre.SRE_Match object; span=(42, 43), match='i'>
<_sre.SRE_Match object; span=(43, 44), match='t'>
<_sre.SRE_Match object; span=(44, 45), match='y'>
<_sre.SRE_Match object; span=(45, 46), match=','>
<_sre.SRE_Match object; span=(47, 48), match='M'>
<_sre.SRE_Match object; span=(48, 49), match='D'>
<_sre.SRE_Match object; span=(50, 51), match='1'>
<_sre.SRE_Match object; span=(51, 52), match='5'>
<_sre.SRE_Match object; span=(52, 53), match='6'>
<_sre.SRE_Match object; span=(53, 54), match='7'>
<_sre.SRE_Match object; span=(54, 55), match='8'>
<_sre.SRE_Match object; span=(55, 56), match='9'>
<_sre.SRE_Match object; span=(56, 57), match='.'>

We can see that none of the matches above are whitespace characters. We also see, that by using \S we were able to match all characters, including periods (.), letters, and numbers.

TODO: Print The Numbers Between Whitespace Characters

In the cell below, our sample_text consists of a multi-line string with numbers in between whitespace characters:

123 45  7895
1   222 33

Notice that not all the numbers have the same number of digits. For example, the first number (123 ) has three digits, but the second number (45 ) only has two digits.

Write a single regular expression that finds the tabs (\t) and the newlines (\n) in this multi-line string and save the regular expression object in a variable called regex. Then use the .finditer() method to search the sample_text string for the given regular expression. Then, write a loop that uses the span information from each match to only print the numbers found in the original multi-line string. Your code should work in the general case where the numbers can have any number of digits. For example, if the numbers in the string were to change your code should still be able to find them and print them. Finally, in this exercise you cannot use \d in your regular expression.

HINT : Notice that there are no whites paces in the multiline string. Use the \s sequence to find the tabs and newlines. Then notice that you can use the span's end and start index from consecutive matches to figure out the number of digits of each number. Use these indices to print the numbers found in the original multi-line string. You can use the match.span() method we saw before to find the start and end indices of each match. Alternatively, you can also use the .start() and .end() methods to extract the start and end indices of each match. The match.start() is equivalent to match.span()[0] and match.end() is equivalent to match.span()[1].

# Import re module
import re

# Sample text
sample_text = '''
123\t45\t7895
1\t222\t33
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = re.compile(r'\s')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Write a loop to print all the numbers found in the original string
# Set counter
counter = 0

# Write a loop to print all the numbers found in the original string
for match in matches:    
    if counter != 0:
        start_idx = match.start()        
        print('\nNumbers from the original text:', sample_text[end_idx:start_idx])        
    end_idx = match.end()
    counter += 1

Sample Text:

123 45  7895
1   222 33

Numbers from the original text: 123

Numbers from the original text: 45

Numbers from the original text: 7895

Numbers from the original text: 1

Numbers from the original text: 222

Numbers from the original text: 33

Matching Alphanumeric Characters Using `\w`

In the code below, we will use \w as our regular expression to find all the alphanumeric characters in our sample_text string. This includes the underscore ( _ ), all the numbers from 0 through 9, and all the uppercase and lowercase letters:

# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\w'
regex = re.compile(r'\w')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='Y'>
<_sre.SRE_Match object; span=(2, 3), match='o'>
<_sre.SRE_Match object; span=(3, 4), match='u'>
<_sre.SRE_Match object; span=(5, 6), match='c'>
<_sre.SRE_Match object; span=(6, 7), match='a'>
<_sre.SRE_Match object; span=(7, 8), match='n'>
<_sre.SRE_Match object; span=(9, 10), match='c'>
<_sre.SRE_Match object; span=(10, 11), match='o'>
<_sre.SRE_Match object; span=(11, 12), match='n'>
<_sre.SRE_Match object; span=(12, 13), match='t'>
<_sre.SRE_Match object; span=(13, 14), match='a'>
<_sre.SRE_Match object; span=(14, 15), match='c'>
<_sre.SRE_Match object; span=(15, 16), match='t'>
<_sre.SRE_Match object; span=(17, 18), match='F'>
<_sre.SRE_Match object; span=(18, 19), match='A'>
<_sre.SRE_Match object; span=(19, 20), match='K'>
<_sre.SRE_Match object; span=(20, 21), match='E'>
<_sre.SRE_Match object; span=(22, 23), match='C'>
<_sre.SRE_Match object; span=(23, 24), match='o'>
<_sre.SRE_Match object; span=(24, 25), match='m'>
<_sre.SRE_Match object; span=(25, 26), match='p'>
<_sre.SRE_Match object; span=(26, 27), match='a'>
<_sre.SRE_Match object; span=(27, 28), match='n'>
<_sre.SRE_Match object; span=(28, 29), match='y'>
<_sre.SRE_Match object; span=(30, 31), match='a'>
<_sre.SRE_Match object; span=(31, 32), match='t'>
<_sre.SRE_Match object; span=(34, 35), match='f'>
<_sre.SRE_Match object; span=(35, 36), match='a'>
<_sre.SRE_Match object; span=(36, 37), match='k'>
<_sre.SRE_Match object; span=(37, 38), match='e'>
<_sre.SRE_Match object; span=(38, 39), match='_'>
<_sre.SRE_Match object; span=(39, 40), match='c'>
<_sre.SRE_Match object; span=(40, 41), match='o'>
<_sre.SRE_Match object; span=(41, 42), match='m'>
<_sre.SRE_Match object; span=(42, 43), match='p'>
<_sre.SRE_Match object; span=(43, 44), match='a'>
<_sre.SRE_Match object; span=(44, 45), match='n'>
<_sre.SRE_Match object; span=(45, 46), match='y'>
<_sre.SRE_Match object; span=(46, 47), match='1'>
<_sre.SRE_Match object; span=(47, 48), match='2'>
<_sre.SRE_Match object; span=(49, 50), match='e'>
<_sre.SRE_Match object; span=(50, 51), match='m'>
<_sre.SRE_Match object; span=(51, 52), match='a'>
<_sre.SRE_Match object; span=(52, 53), match='i'>
<_sre.SRE_Match object; span=(53, 54), match='l'>
<_sre.SRE_Match object; span=(55, 56), match='c'>
<_sre.SRE_Match object; span=(56, 57), match='o'>
<_sre.SRE_Match object; span=(57, 58), match='m'>

As we can see, all the matches found correspond to alphanumeric characters only, including the underscore in the email address.

Conversely, if wanted to find all the characters that are not alphanumeric characters, we will use \W as our regular expression, as shown below:

# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\W'
regex = re.compile(r'\W')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(4, 5), match=' '>
<_sre.SRE_Match object; span=(8, 9), match=' '>
<_sre.SRE_Match object; span=(16, 17), match=' '>
<_sre.SRE_Match object; span=(21, 22), match=' '>
<_sre.SRE_Match object; span=(29, 30), match=' '>
<_sre.SRE_Match object; span=(32, 33), match=':'>
<_sre.SRE_Match object; span=(33, 34), match='\n'>
<_sre.SRE_Match object; span=(48, 49), match='@'>
<_sre.SRE_Match object; span=(54, 55), match='.'>
<_sre.SRE_Match object; span=(58, 59), match='.'>
<_sre.SRE_Match object; span=(59, 60), match='\n'>

We can see that none of the matches are alphanumeric characters. We also see, that by using \W we were able to match all whitespace characters, and the @ symbol in the email address.

TODO: Find emails

In the cell below, our sample_text consists of a multi-line string that contains three email addresses:

j.s@email.com
a.w@email.com
m.j@email.com

Notice, that all three email address have the same pattern, namely, the first name initial, followed by a dot (.), followed by the last name initial, and ending in @email.com.

Take advantage of the fact that all three email addresses have the same pattern to write a single regular expression that can find all three email addresses in our sample_text string. As usual, save the regular expression object in a variable called regex. Then use the .finditer() method to search the sample_text string for the given regular expression. Finally, write a loop to print all the matches found by the .finditer() method.

# Import re module
import re

# Sample text
sample_text = '''
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = re.compile(r'\w\.\w@email.com')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

Sample Text:

John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com

<_sre.SRE_Match object; span=(15, 28), match='j.s@email.com'>
<_sre.SRE_Match object; span=(44, 57), match='a.w@email.com'>
<_sre.SRE_Match object; span=(70, 83), match='m.j@email.com'>

If you wrote your regex correctly you should see three matches above corresponding to the three email addresses found in our sample_text string.

Word Boundaries

We will now learn about another special sequence that you can create using the backslash:

\b

This special sequence doesn't really match a particular set of characters, but rather determines word boundaries. A word in this context is defined as a sequence of alphanumeric characters, while a boundary is defined as a white space, a non-alphanumeric character, or the beginning or end of a string. We can have boundaries either before or after a word. Let's see how this works with an example.

In the code below, our sample_text string contains the following sentence:

The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.

As we can see the word class appears in three different positions:

As a stand-alone word: The word class has white spaces both before and after it.
At the beginning of a word: The word class in classroom has a white space before it.
At the end of a word: The word class in subclass has a whitespace after it.

If we use class as our regular expression, we will match the word class in all three positions as shown in the code below:

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class'
regex = re.compile(r'class')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(47, 52), match='class'>
<_sre.SRE_Match object; span=(85, 90), match='class'>

We can see that we have three matches, corresponding to all the instances of the word class in our sample_text string.

Now, let's use word boundaries to only find the word class when it appears in particular positions. Let’s start by using \b to only find the word class when it appears at the beginning of a word. We can do this by adding \b before the word class in our regular expression as shown below:

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass'
regex = re.compile(r'\bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(47, 52), match='class'>

We can see that now we only have two matches because it's only matching the stand-alone word, class, and the class in classroom since both of them have a word boundary (in this case a white space) directly before them. We can also see that it is not matching the class in subclass because there is no word boundary directly before it.

Now, let's use \b to only find the word class when it appears at the end of a word. We can do this by adding \b after the word class in our regular expression as shown below:

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class\b'
regex = re.compile(r'class\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>
<_sre.SRE_Match object; span=(85, 90), match='class'>

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass\b'
regex = re.compile(r'\bclass\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 17), match='class'>

We can see that now we only have one match because the stand-alone word, class, is the only one that has a word boundary (in this case a white space) directly before and after it.

TODO: Find All 3-Letter Words

# Import re module
import re

# Sample text
sample_text = 'John went to the store in his car, but forgot to buy bread.'

# Create a regular expression object with the regular expression
regex = re.compile(r'to')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(10, 12), match='to'>
<_sre.SRE_Match object; span=(18, 20), match='to'>
<_sre.SRE_Match object; span=(46, 48), match='to'>

Not A Word Boundary

As with the other special sequences that we saw before, we also have the uppercase version of \b, namely:

\B

As with the other special sequences, \B indicates the opposite of \b. So if \b is used to indicate a word boundary, \B is used to indicate not a word boundary. Let's see how this works:

Let's use \B to only find the word class when it doesn't have a word boundary directly before it. We can do this by adding \B before the word class in our regular expression as shown below:

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass'
regex = re.compile(r'\Bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(85, 90), match='class'>

We can see that we only get one match because the class in subclass is the only one that doesn't have a word boundary directly before it.

Now, let's use \B to only find the word class when it doesn't have a word boundary directly after it. We can do this by adding \B after the word class in our regular expression as shown below:

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class\B'
regex = re.compile(r'class\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(47, 52), match='class'>

# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

In this case, we can see that we get no matches. This is because all instances of the word class in our sample_text string, have a boundary either before or after it. In order to have a match in this case, the word class will have to appear in the middle of a word, such as in the word declassified. Let's see an example:

# Import re module
import re

# Sample text
sample_text = 'declassified'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(2, 7), match='class'>

TODO: Finding Last Digits

# Import re module
import re

# Sample text
sample_text = '203 3 403 687 283 234 983 345 23 3 74 978'

# Create a regular expression object with the regular expression
regex = re.compile(r'\B3\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Set counter
num_matches = 0

# Print all the matches
for match in matches:
    print(match)
    num_matches += 1

# Print the total number of matches    
print('\nTotal Number of Matches:', num_matches)

<_sre.SRE_Match object; span=(2, 3), match='3'>
<_sre.SRE_Match object; span=(8, 9), match='3'>
<_sre.SRE_Match object; span=(16, 17), match='3'>
<_sre.SRE_Match object; span=(24, 25), match='3'>
<_sre.SRE_Match object; span=(31, 32), match='3'>

Total Number of Matches: 5

If you wrote your code correctly you should get a total of 5 matches.

Character Sets

匹配电话号码

# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555- 123- 4567
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213(555)999-8464
'''

# Create a regular expression object with a regular expression
regex = re.compile(r'\+\d+[-  \(]\d+[- \)]\d+[- ]\d+')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

结果打印：

<_sre.SRE_Match object; span=(12, 27), match='+1-555-123-4567'>
<_sre.SRE_Match object; span=(40, 56), match='+61 455 555 4549'>
<_sre.SRE_Match object; span=(70, 87), match='+375-655-777-7346'>
<_sre.SRE_Match object; span=(100, 117), match='+213(555)999-8464'>

为者常成，行者常至

AI For Trading: Regular Expressions of NLP-2 (87)

Searching For Simple Patterns

Matching Numbers Using `\d`

Matching Whitespace Characters Using `\s`

Matching Alphanumeric Characters Using `\w`

Word Boundaries

Character Sets

AI

作者：Corwien

专栏推荐

AI For Trading: Regular Expressions of NLP-2 (87)

Searching For Simple Patterns

Matching Numbers Using \d

Matching Whitespace Characters Using \s

Matching Alphanumeric Characters Using \w

Word Boundaries

Character Sets

添加附言

AI

作者：Corwien

专栏推荐

Matching Numbers Using `\d`

Matching Whitespace Characters Using `\s`

Matching Alphanumeric Characters Using `\w`