AI For Trading: Applying Regexes To 10-Ks (89)

Applying Regexes To 10-Ks

Introduction

In this notebook you will apply regexes to find useful financial information in 10-Ks. In particular, you will use what you learned in previous lessons to extract text from Items 1A, 7, and 7A.

Getting The HTML File

In this lesson, we will be working with the 2018, 10-K from Apple. In the code below, we will use the requests library to get the HTML data from this 10-K directly from the SEC website. We will learn more about the requests library in a later lesson. We will save the HTML data into a string variable named raw_10k, as shown below:

# Import requests
import requests

# Get the HTML data from the 2018 10-K from Apple
r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')
raw_10k = r.text

If we print the raw_10k string we will see that it has many sections. In the code below, we print part of the raw_10k string:

print(raw_10k[0:2000])
<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105
<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105
<ACCEPTANCE-DATETIME>20181105080140
ACCESSION NUMBER:       0000320193-18-000145
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      88
CONFORMED PERIOD OF REPORT: 20180929
FILED AS OF DATE:       20181105
DATE AS OF CHANGE:      20181105

FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         APPLE INC
        CENTRAL INDEX KEY:          0000320193
        STANDARD INDUSTRIAL CLASSIFICATION: ELECTRONIC COMPUTERS [3571]
        IRS NUMBER:             942404110
        STATE OF INCORPORATION:         CA
        FISCAL YEAR END:            0930

    FILING VALUES:
        FORM TYPE:      10-K
        SEC ACT:        1934 Act
        SEC FILE NUMBER:    001-36743
        FILM NUMBER:        181158788

    BUSINESS ADDRESS:   
        STREET 1:       ONE APPLE PARK WAY
        CITY:           CUPERTINO
        STATE:          CA
        ZIP:            95014
        BUSINESS PHONE:     (408) 996-1010

    MAIL ADDRESS:   
        STREET 1:       ONE APPLE PARK WAY
        CITY:           CUPERTINO
        STATE:          CA
        ZIP:            95014

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  APPLE COMPUTER INC
        DATE OF NAME CHANGE:    19970808
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>a10-k20189292018.htm
<DESCRIPTION>10-K
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <!-- Document created using Wdesk 1 -->
        <!-- Copyright 2018 Workiva -->
        <title>Document</title>
    </head>
    <body style="font-family:Times New Roman;font-size:10pt;">
<div><a name="s3540C27286EF5B0DA103CC59028B96BE"></a></div><div style="line-height:120%;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></tr><tr><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000

TODO: Regexes for Tags

For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K, are contained within the <DOCUMENT> and </DOCUMENT> tags. Each section within the document tags is clearly marked by a <TYPE> tag followed by the name of the section. In the code below, write three regular expressions:

  1. A regex to find the <DOCUMENT> tag

  2. A regex to find the </DOCUMENT> tag

  3. A regex to find all the sections marked by the <Type> tag

# import re module
import re

# Write regexes
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_end_pattern = re.compile(r'</DOCUMENT>')
type_pattern = re.compile(r'<TYPE>[^\n]+')

TODO: Create Lists with Span Indices

Now, that you have the regexes, use the .finditer() method to match the regexes in the raw_10k. In the code below, create 3 lists:

  1. A list that holds the .end() index of each match of doc_start_pattern

  2. A list that holds the .start() index of each match of doc_end_pattern

  3. A list that holds the name of section from each match of type_pattern

# Create 3 lists with the span idices for each regex
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]
doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]

# print(doc_end_is)
# print(doc_types)

TODO: Create a Dictionary for the 10-K

In the code below, create a dictionary which has the key 10-K and as value the contents of the 10-K section found above. To do this, create a loop, to go through all the sections found above, and if the section type is 10-K then save it to the dictionary. Use the indices in doc_start_is and doc_end_isto slice the raw_10k file.

document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start_i, doc_end_i in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start_i:doc_end_i]

# display the document
# document

TODO: Find Item 1A, 7, and 7A

Our task now is to use regular expression to find the items of interest. The items in this document can be found in four different patterns. For example Item 1A can be found in either of the following patterns:

  1. >Item 1A

  2. >Item&#160;1A

  3. >Item&nbsp;1A

  4. ITEM 1A

In the code below write a single regular expression that can match all four patterns for Items 1A, 7, and 7A. Then use the .finditer() method to match the regex to document['10-K']. Finally create a for loop to print the matches.

# Write the regex
regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|7A|7)\.{0,1})|(ITEM\s(1A|7A|7))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(38318, 38327), match='>Item 1A.'>
<_sre.SRE_Match object; span=(46148, 46156), match='>Item 7.'>
<_sre.SRE_Match object; span=(47281, 47290), match='>Item 7A.'>
<_sre.SRE_Match object; span=(119131, 119140), match='>Item 1A.'>
<_sre.SRE_Match object; span=(333318, 333326), match='>Item 7.'>
<_sre.SRE_Match object; span=(729984, 729993), match='>Item 7A.'>

If your regex is written correctly, the only matches above should be those for Items 1A, 7, and 7A. You should notice also, that each item is matched twice. This is because each item appears first in the index and then in the corresponding section. We will now have to remove the matches that correspond to the index. We will do this using Pandas in the next section.

Remove Matches that Correspond to the Index

We will remove the matches that correspond to the index using a Pandas Dataframe. We will do this in a couple of steps.

TODO: Create a Pandas DataFrame

In the code below create a pandas dataframe with the following column names: 'item','start','end'. In the item column save the match.group() in lower case letters, in the start column save the match.start(), and in the end column save the `match.end().

# import pandas
import pandas as pd

# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(),x.start(),x.end()) for x in matches])
test_df.columns = ['item','start','end']
test_df['item'] = test_df.item.str.lower()

# Display the dataframe
test_df
item start end
0 >item 1a. 38318 38327
1 >item 7. 46148 46156
2 >item 7a. 47281 47290
3 >item 1a. 119131 119140
4 >item 7. 333318 333326
5 >item 7a. 729984 729993

TODO: Eliminate Unnecessary Characters

As we can see, our dataframe, in particular the item column, contains some unnecessary characters such as > and periods .. In some cases, we will also get unicode characters such as &#160; and &nbsp;. In the code below, use the Pandas dataframe method .replace() with the keyword regex=True to replace all whitespaces, the above mentioned unicode characters, the > character, and the periods from our dataframe. We want to do this because we want to use the item column as our dataframe index later on.

# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df
item start end
0 item1a 38318 38327
1 item7 46148 46156
2 item7a 47281 47290
3 item1a 119131 119140
4 item7 333318 333326
5 item7a 729984 729993

TODO: Remove Duplicates

Now that we have removed all unnecessary characters form our dataframe, we can go ahead and remove the Item matches that correspond to the index. In the code below use the Pandas dataframe .drop_duplicates() method to only keep the last Item matches in the dataframe and drop the rest. Just as precaution make sure that the start column is sorted in ascending order before dropping the duplicates.

# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'],keep='last')

# Display the dataframe
pos_dat
item start end
3 item1a 119131 119140
4 item7 333318 333326
5 item7a 729984 729993

TODO: Set Item to Index

In the code below use the Pandas dataframe .set_index() method to set the item column as the index of our dataframe.

# Set item as the dataframe index
pos_dat.set_index('item', inplace =True)

# display the dataframe
pos_dat
start end
item
item1a 119131 119140
item7 333318 333326
item7a 729984 729993

TODO: Get The Financial Information From Each Item

The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, save all the text from the starting index of item1a till the starting index of item7 into a variable called item_1a_raw. Similarly, save all the text from the starting index of item7 till the starting index of item7a into a variable called item_7_raw. Finally, save all the text from the starting index of item7a till the end of document['10-K'] into a variable called item_7a_raw. You can accomplish all of this by making the correct slices of document['10-K'].

# Get Item 1a
item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item7']]

# Get Item 7
item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]

# Get Item 7a
item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:]

TODO: Display Item 1a

Now that we have each item saved into a separate variable we can view them separately. For illustration purposes we will display Item 1a, but the other items will look similar.

# Display Item 1a
item_1a_raw[0:300]
'>Item 1A.</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align'

We can see that the items looks pretty messy, they contain HTML tags, Unicode characters, etc... Before we can do a proper Natural Language Processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In principle we could do this using regex substitutions as we learned previously, but his can be rather difficult. Luckily, packages already exist that can do all the cleaning for us, such as Beautifulsoup, which will be the topic of our next lessons.

为者常成,行者常至