AI For Trading: BeautifulSoup (90)

Intro

BeautifulSoupPotential Problems

  • BeautifulSoup works best when you have perfectly formatted HTML

BeautifulSoup 中文文档

Parsers

In the following lessons you will learn how to use the BeautifulSoup library to pull data out of HTML and XML files. BeautifulSoup uses a parser to transform files into a tree of Python objects that can be easily searched. So, before we start learning how to use BeautifulSoup, let's take a quick look at parsers.

In BeautifulSoup, the parser is a piece of software whose primary job is to build a data structure in the form of a hierarchical tree that gives a structural representation of the HTML or XML file. In other words, the parser divides these complex files into simpler parts while keeping track of how these parts are related to each other. BeautifulSoup supports a number of parsers, but throughout these lessons we will only be using the lxml parser. The lxml parser can be used to parse both HTML and XML files and has the advantage of being very fast. In order to use the lxml parser, you must have lxml installed. You can install the lxml parser by using the following command in your terminal:

$ pip install lxml

If you're working with perfectly formatted HTML or XML files (i.e. files that don't contain any missing information or mistakes) then, in the majority of cases, your choice of parser shouldn't really matter. However, if the files you are working with have missing information or mistakes, then your choice of parser will matter because each parser has different rules for dealing with missing information or mistakes. Consequently, in these cases, different parsers will create different parse trees for the same document. You can take a look at the differences between parsers, in the BeautifulSoup documentation, for details.

Searching The Parse Tree

BeautifulSoup provides a number of methods for searching the parse tree, but we will only cover the .find_all() method in this lesson. You can learn about other search methods in the BeautifulSoup Documentation.

The .find_all(filter) method will search an entire document for the given filter. The filter can be a string containing the HTML or XML tag name, a tag attribute, or even a regular expression. In this notebook we will see examples of these cases.

So let's begin by using the .find_all() method to find all <h2> tags in our sample.html file. To do this, we will pass the string 'h2' to the .find_all() method as shown in the code below:

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Find all the h2 tags
h2_list = page_content.find_all('h2')

# Print the h2_list
print(h2_list)
[<h2 class="h2style" id="hub">Student Hub</h2>, <h2 class="h2style" id="know">Knowledge</h2>]

As we can see, the .find_all() method returns a list with all the <h2> tags. Since lists are iterables, we can loop through the h2_list to print each tag, as shown below:

# Print each tag in the h2_list
for tag in h2_list:
    print(tag)
<h2 class="h2style" id="hub">Student Hub</h2>
<h2 class="h2style" id="know">Knowledge</h2>

TODO: Find All The <p> Tags

In the cell below, use the .find_all() method to find all the <p> tags in the sample.html file. Start by opening the sample.html file and passing the open filehandle to the BeautifulSoup constructor using the lxml parser. Save the BeautifulSoup object returned by the constructor in a variable called page_content. Then use the .find_all() method to find all the <p> tags from the page_content object. Save the list returned by the .find_all() method in a variable called p_list. Finally, loop through the list and print each tag in the list. Since the <p> tags contain subtags, use the .prettify() method to improve readability.

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Find all the p tags
p_list = page_content.find_all('p')

# Print each tag in the p_list
for p in p_list:
    print(p)
<p>Student Hub is our real time collaboration platform where you can work with peers and mentors. You will find Community rooms with other students and alumni.</p>
<p>Search or ask questions in <a href="https://knowledge.udacity.com/">Knowledge</a></p>
<p>Good luck and we hope you enjoy the course</p>

Searching For Multiple Tags

We can also search for more than one tag at a time by passing a list to the .find_all() method. Let's see an example.

Let's suppose we wanted to search for all the <h2> and <p> tags in our sample.html file. Instead of using two statements (one for each tag):

h2_list = page_content.find_all('h2')
p_list = page_content.find_all('p')

we can just pass the list ['h2', 'p'] to the .find_all() method, as shown in the code below:

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print all the h2 and p tags
for tag in page_content.find_all(['h2', 'p']):
    print(tag.prettify())
<h2 class="h2style" id="hub">
 Student Hub
</h2>

<p>
 Student Hub is our real time collaboration platform where you can work with peers and mentors. You will find Community rooms with other students and alumni.
</p>

<h2 class="h2style" id="know">
 Knowledge
</h2>

<p>
 Search or ask questions in
 <a href="https://knowledge.udacity.com/">
  Knowledge
 </a>
</p>

<p>
 Good luck and we hope you enjoy the course
</p>

We can see that we get all the <h2> and <p> tags in our file.

TODO: Find All The <a> and <link> Tags

In the cell below, use the .find_all() method to find all the <a> and <link> tags in the sample.html file. Start by opening the sample.html file and passing the open filehandle to the BeautifulSoup constructor using the lxml parser. Save the BeautifulSoup object returned by the constructor in a variable called page_content. Then find all the <a> and<link> tags from the page_content object by passing a list to the .find_all() method. Loop through the list and print each tag in the list. Use the .prettify() method to improve readability.

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print all the a and link tags

# Print all the h2 and p tags
for tag in page_content.find_all(['a', 'link']):
    print(tag.prettify())
<link href="./teststyle.css" rel="stylesheet"/>

<a href="https://knowledge.udacity.com/">
 Knowledge
</a>

Searching For Tags With Particular Attributes

The .find_all() method also allows us to pass some arguments, such as the attribute of a tag, so that we can search the entire document for the exact tag we want. For example, in our sample.html file, we have two <h2> tags:

  1. <h2 class="h2style" id="hub">Student Hub</h2>

  2. <h2 class="h2style" id="know">Knowledge</h2>

We can see that the first <h2> tag has the attribute id="hub", while the second <h2> tag has the attribute id="know". Let's suppose, we only wanted to search our sample.html document for the <h2> tag that had id="know". To do this, we will add the id attribute to the .find_all() method as shown below:

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Find the h2 tag with id = know
h2_know = page_content.find_all('h2', id = 'know')

# Print each item in the h2_know
for tag in h2_know:
    print(tag)
<h2 class="h2style" id="know">Knowledge</h2>

We can see, the list returned by the .find_all() method only has one element, and it corresponds to the <h2> tag that has id="know".

TODO: Find All The <h1> Tags With The Attribute id='intro'

In the cell below, use the .find_all() method to find all the <h1> tags in the sample.html file that have the attribute id="intro". Start by opening the sample.html file and passing the open filehandle to the BeautifulSoup constructor using the lxml parser. Save the BeautifulSoup object returned by the constructor in a variable called page_content. Then find all the <h1> tags that have the attribute id="intro" from the page_content object. Loop through the list and print each tag in the list.

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print all the h1 tags with id = intro
h1_intro = page_content.find_all('h1', id = 'intro')

for tag in h1_intro:
    print(tag)
<h1 id="intro">Get Help From Peers and Mentors</h1>

Searching For Attributes Directly

The .find_all() method also allows us to search for tag attributes directly. For example, we can search for all the tags that have the attribute id="intro" by only passing this attribute to the .find_all() method, as shown below:

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print all the tags with id = intro
for tag in page_content.find_all(id = 'intro'):
    print(tag)
<h1 id="intro">Get Help From Peers and Mentors</h1>

We can see that we only get one tag, since the <h1> tag is the only tag in our document that has the attribute id="intro".

TODO: Find All Tags With Attribute id='hub'

In the cell below, use the .find_all() method to find all the tags in the sample.html file that have the attribute id="hub". Start by opening the sample.html file and passing the open filehandle to the BeautifulSoup constructor using the lxml parser. Save the BeautifulSoup object returned by the constructor in a variable called page_content. Then find all the tags that have the attribute id="hub" from the page_content object. Loop through the list and print each tag in the list.

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print all the tags with id = hub
for tag in page_content.find_all(id = 'hub'):
    print(tag)
<h2 class="h2style" id="hub">Student Hub</h2>

Searching by class

Let's suppose we wanted to find all the tags that had the attribute class="h2style". Unfortunately, in this case, we can't simply pass this attribute to the .find_all() method. The reason is that the CSS attribute, class, is a reserved word in Python. Therefore, using class as a keyword argument in the .findall() method, will give you a syntax error. To get around this problem, BeautifulSoup has implemented the keyword class (notice the underscore at the end) that can be used to search for the class attribute. Let's see how this works.

In the code below, we will use the .find_all() method to search for all the tags in our sample.html file that have the attribute class="h2style":

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print the tags that have the attribute class_ = 'h2style'
for tag in page_content.find_all(class_ = 'h2style'):
    print(tag)

Print:

<h2 class="h2style" id="hub">Student Hub</h2>
<h2 class="h2style" id="know">Knowledge</h2>

Searching With Regular Expressions

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Import the re module
import re 

# Open the HTML file and create a BeautifulSoup Object
with open('./sample.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Print only the tag names of all the tags whose name contain the letter i
for tag in page_content.find_all(re.compile(r'i')):
    print(tag.name)

PRINT:

title
link
div
div
div

Children Tags

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and print the text in the title tag within the head tag
with open('./sample2.html') as f:
    print(BeautifulSoup(f, 'lxml').head.title.get_text())
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample2.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Access the head tag
page_head = page_content.head

# Print the children of the head tag
print(page_head.contents)

# Print the number of children of the head tag
print('\nThe <head> tag contains {} children'.format(len(page_head.contents)))

PRINT:

[<title>AI For Trading</title>, <meta charset="utf-8"/>, <link href="./teststyle.css" rel="stylesheet"/>, <style>.h2style {background-color: tomato;color: white;padding: 10px;}</style>]

The <head> tag contains 4 children

The Recursive Argument

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Open the HTML file and create a BeautifulSoup Object
with open('./sample2.html') as f:
    page_content = BeautifulSoup(f, 'lxml')

# Search the html tag's direct children for the title tag
for tag in page_content.html.find_all('title', recursive = False):
    print(tag)

BeautifulSoup 中文文档

为者常成,行者常至