AI For Trading: Applying Regexes To 10-Ks (89)
Applying Regexes To 10-Ks
Introduction
In this notebook you will apply regexes to find useful financial information in 10-Ks. In particular, you will use what you learned in previous lessons to extract text from Items 1A, 7, and 7A.
Getting The HTML File
In this lesson, we will be working with the 2018, 10-K from Apple. In the code below, we will use the requests library to get the HTML data from this 10-K directly from the SEC website. We will learn more about the requests library in a later lesson. We will save the HTML data into a string variable named raw_10k, as shown below:
# Import requests
import requests
# Get the HTML data from the 2018 10-K from Apple
r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')
raw_10k = r.text
If we print the raw_10k string we will see that it has many sections. In the code below, we print part of the raw_10k string:
print(raw_10k[0:2000])
<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105
<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105
<ACCEPTANCE-DATETIME>20181105080140
ACCESSION NUMBER: 0000320193-18-000145
CONFORMED SUBMISSION TYPE: 10-K
PUBLIC DOCUMENT COUNT: 88
CONFORMED PERIOD OF REPORT: 20180929
FILED AS OF DATE: 20181105
DATE AS OF CHANGE: 20181105
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: APPLE INC
CENTRAL INDEX KEY: 0000320193
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRONIC COMPUTERS [3571]
IRS NUMBER: 942404110
STATE OF INCORPORATION: CA
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 10-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-36743
FILM NUMBER: 181158788
BUSINESS ADDRESS:
STREET 1: ONE APPLE PARK WAY
CITY: CUPERTINO
STATE: CA
ZIP: 95014
BUSINESS PHONE: (408) 996-1010
MAIL ADDRESS:
STREET 1: ONE APPLE PARK WAY
CITY: CUPERTINO
STATE: CA
ZIP: 95014
FORMER COMPANY:
FORMER CONFORMED NAME: APPLE COMPUTER INC
DATE OF NAME CHANGE: 19970808
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>a10-k20189292018.htm
<DESCRIPTION>10-K
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<!-- Document created using Wdesk 1 -->
<!-- Copyright 2018 Workiva -->
<title>Document</title>
</head>
<body style="font-family:Times New Roman;font-size:10pt;">
<div><a name="s3540C27286EF5B0DA103CC59028B96BE"></a></div><div style="line-height:120%;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></tr><tr><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000TODO: Regexes for Tags
For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K, are contained within the <DOCUMENT> and </DOCUMENT> tags. Each section within the document tags is clearly marked by a <TYPE> tag followed by the name of the section. In the code below, write three regular expressions:
-
A regex to find the
<DOCUMENT>tag -
A regex to find the
</DOCUMENT>tag -
A regex to find all the sections marked by the
<Type>tag
# import re module
import re
# Write regexes
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_end_pattern = re.compile(r'</DOCUMENT>')
type_pattern = re.compile(r'<TYPE>[^\n]+')TODO: Create Lists with Span Indices
Now, that you have the regexes, use the .finditer() method to match the regexes in the raw_10k. In the code below, create 3 lists:
-
A list that holds the
.end()index of each match ofdoc_start_pattern -
A list that holds the
.start()index of each match ofdoc_end_pattern -
A list that holds the name of section from each match of
type_pattern
# Create 3 lists with the span idices for each regex
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]
doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]
# print(doc_end_is)
# print(doc_types)TODO: Create a Dictionary for the 10-K
In the code below, create a dictionary which has the key 10-K and as value the contents of the 10-K section found above. To do this, create a loop, to go through all the sections found above, and if the section type is 10-K then save it to the dictionary. Use the indices in doc_start_is and doc_end_isto slice the raw_10k file.
document = {}
# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start_i, doc_end_i in zip(doc_types, doc_start_is, doc_end_is):
if doc_type == '10-K':
document[doc_type] = raw_10k[doc_start_i:doc_end_i]
# display the document
# documentTODO: Find Item 1A, 7, and 7A
Our task now is to use regular expression to find the items of interest. The items in this document can be found in four different patterns. For example Item 1A can be found in either of the following patterns:
-
>Item 1A -
>Item 1A -
>Item 1A -
ITEM 1A
In the code below write a single regular expression that can match all four patterns for Items 1A, 7, and 7A. Then use the .finditer() method to match the regex to document['10-K']. Finally create a for loop to print the matches.
# Write the regex
regex = re.compile(r'(>Item(\s| | )(1A|7A|7)\.{0,1})|(ITEM\s(1A|7A|7))')
# Use finditer to math the regex
matches = regex.finditer(document['10-K'])
# Write a for loop to print the matches
for match in matches:
print(match)
<_sre.SRE_Match object; span=(38318, 38327), match='>Item 1A.'>
<_sre.SRE_Match object; span=(46148, 46156), match='>Item 7.'>
<_sre.SRE_Match object; span=(47281, 47290), match='>Item 7A.'>
<_sre.SRE_Match object; span=(119131, 119140), match='>Item 1A.'>
<_sre.SRE_Match object; span=(333318, 333326), match='>Item 7.'>
<_sre.SRE_Match object; span=(729984, 729993), match='>Item 7A.'>
If your regex is written correctly, the only matches above should be those for Items 1A, 7, and 7A. You should notice also, that each item is matched twice. This is because each item appears first in the index and then in the corresponding section. We will now have to remove the matches that correspond to the index. We will do this using Pandas in the next section.
Remove Matches that Correspond to the Index
We will remove the matches that correspond to the index using a Pandas Dataframe. We will do this in a couple of steps.
TODO: Create a Pandas DataFrame
In the code below create a pandas dataframe with the following column names: 'item','start','end'. In the item column save the match.group() in lower case letters, in the start column save the match.start(), and in the end column save the `match.end().
# import pandas
import pandas as pd
# Matches
matches = regex.finditer(document['10-K'])
# Create the dataframe
test_df = pd.DataFrame([(x.group(),x.start(),x.end()) for x in matches])
test_df.columns = ['item','start','end']
test_df['item'] = test_df.item.str.lower()
# Display the dataframe
test_df
| item | start | end | |
|---|---|---|---|
| 0 | >item 1a. | 38318 | 38327 |
| 1 | >item 7. | 46148 | 46156 |
| 2 | >item 7a. | 47281 | 47290 |
| 3 | >item 1a. | 119131 | 119140 |
| 4 | >item 7. | 333318 | 333326 |
| 5 | >item 7a. | 729984 | 729993 |
TODO: Eliminate Unnecessary Characters
As we can see, our dataframe, in particular the item column, contains some unnecessary characters such as > and periods .. In some cases, we will also get unicode characters such as   and . In the code below, use the Pandas dataframe method .replace() with the keyword regex=True to replace all whitespaces, the above mentioned unicode characters, the > character, and the periods from our dataframe. We want to do this because we want to use the item column as our dataframe index later on.
# Get rid of unnesesary charcters from the dataframe
test_df.replace(' ',' ',regex=True,inplace=True)
test_df.replace(' ',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)
# display the dataframe
test_df
| item | start | end | |
|---|---|---|---|
| 0 | item1a | 38318 | 38327 |
| 1 | item7 | 46148 | 46156 |
| 2 | item7a | 47281 | 47290 |
| 3 | item1a | 119131 | 119140 |
| 4 | item7 | 333318 | 333326 |
| 5 | item7a | 729984 | 729993 |
TODO: Remove Duplicates
Now that we have removed all unnecessary characters form our dataframe, we can go ahead and remove the Item matches that correspond to the index. In the code below use the Pandas dataframe .drop_duplicates() method to only keep the last Item matches in the dataframe and drop the rest. Just as precaution make sure that the start column is sorted in ascending order before dropping the duplicates.
# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'],keep='last')
# Display the dataframe
pos_dat
| item | start | end | |
|---|---|---|---|
| 3 | item1a | 119131 | 119140 |
| 4 | item7 | 333318 | 333326 |
| 5 | item7a | 729984 | 729993 |
TODO: Set Item to Index
In the code below use the Pandas dataframe .set_index() method to set the item column as the index of our dataframe.
# Set item as the dataframe index
pos_dat.set_index('item', inplace =True)
# display the dataframe
pos_dat
| start | end | |
|---|---|---|
| item | ||
| item1a | 119131 | 119140 |
| item7 | 333318 | 333326 |
| item7a | 729984 | 729993 |
TODO: Get The Financial Information From Each Item
The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, save all the text from the starting index of item1a till the starting index of item7 into a variable called item_1a_raw. Similarly, save all the text from the starting index of item7 till the starting index of item7a into a variable called item_7_raw. Finally, save all the text from the starting index of item7a till the end of document['10-K'] into a variable called item_7a_raw. You can accomplish all of this by making the correct slices of document['10-K'].
# Get Item 1a
item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item7']]
# Get Item 7
item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]
# Get Item 7a
item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:]TODO: Display Item 1a
Now that we have each item saved into a separate variable we can view them separately. For illustration purposes we will display Item 1a, but the other items will look similar.
# Display Item 1a
item_1a_raw[0:300]
'>Item 1A.</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align'
We can see that the items looks pretty messy, they contain HTML tags, Unicode characters, etc... Before we can do a proper Natural Language Processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In principle we could do this using regex substitutions as we learned previously, but his can be rather difficult. Luckily, packages already exist that can do all the cleaning for us, such as Beautifulsoup, which will be the topic of our next lessons.
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)