29  Fetching HTML Data (i.e. “Web Scraping”)

If the data we want to fetch is in HTML format, like most web pages are, we can use the requests package to fetch it, and the beautifulsoup4 package to process it.

Before moving on to process HTML formatted data, it will be important to first review HTML format, using these resources from W3 Schools:

29.1 HTML Lists

Let’s consider this "my_lists.html" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>HTML List Parsing Exercise</title>
    </head>
    <body>
        <h1>HTML List Parsing Exercise</h1>

        <p>This is an HTML page.</p>

        <h2>Favorite Ice cream Flavors</h2>
        <ol id="my-fav-flavors">
            <li>Vanilla Bean</li>
            <li>Chocolate</li>
            <li>Strawberry</li>
        </ol>

        <h2>Skills</h2>
        <ul id="my-skills">
            <li class="skill">HTML</li>
            <li class="skill">CSS</li>
            <li class="skill">JavaScript</li>
            <li class="skill">Python</li>
        </ul>
    </body>
</html>

First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the get function from the requests package, to issue an HTTP GET request (as usual):

import requests

# the URL of some HTML data or web page stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_lists.html"

response = requests.get(request_url)
print(type(response))
<class 'requests.models.Response'>

Then we pass the response text (an HTML formatted string) to the BeautifulSoup class constructor.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text)
type(soup)
bs4.BeautifulSoup

29.2 Finding Elements

The resulting soup object is able to intelligently process the data. We can use the soup’s finder methods to search for specific data elements, called “tags”, based on their names or other attributes. If we want to return the first matching element, we use the find method, whereas if we want to get all matching elements, we use the find_all method.

29.2.1 Finding Elements by Identifier

Since the example HTML contains an ordered list (ol element) with a unique identifier of "my-fav-flavors", we can use the following code to access it:

ul = soup.find("ol", id="my-fav-flavors")
print(type(ul))
ul
<class 'bs4.element.Tag'>
<ol id="my-fav-flavors">
<li>Vanilla Bean</li>
<li>Chocolate</li>
<li>Strawberry</li>
</ol>

Getting all child <li> elements from that list:

flavors = ul.find_all("li")
print(type(flavors))
print(len(flavors))
flavors
<class 'bs4.element.ResultSet'>
3
[<li>Vanilla Bean</li>, <li>Chocolate</li>, <li>Strawberry</li>]

Looping through the items:

for li in flavors:
    print("-----------")
    print(type(li))
    print(li.text)
-----------
<class 'bs4.element.Tag'>
Vanilla Bean
-----------
<class 'bs4.element.Tag'>
Chocolate
-----------
<class 'bs4.element.Tag'>
Strawberry

29.2.2 Finding Elements by Class

In that first example, we accessed an item based on its unique identifier, but in this example we will access a number of items by their class. In HTML, only one element can have a given id, but many elements can be members of the same class.

Since the example HTML contains an unordered list (ul element) of skills, where each list item shares the same class of "skill", we can use the following code to access the list items directly:

# get all <li> elements that have a given class of "skill"
skills = soup.find_all("li", "skill")
print(type(skills))
print(len(skills))
skills
<class 'bs4.element.ResultSet'>
4
[<li class="skill">HTML</li>,
 <li class="skill">CSS</li>,
 <li class="skill">JavaScript</li>,
 <li class="skill">Python</li>]

Looping through the results:

for li in skills:
    print("-----------")
    print(type(li))
    print(li.text)
-----------
<class 'bs4.element.Tag'>
HTML
-----------
<class 'bs4.element.Tag'>
CSS
-----------
<class 'bs4.element.Tag'>
JavaScript
-----------
<class 'bs4.element.Tag'>
Python

29.3 HTML Tables

Let’s consider this "my_tables.html" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>HTML Table Parsing Exercise</title>
    </head>
    <body>
        <h1>HTML Table Parsing Exercise</h1>

        <p>This is an HTML page.</p>

        <h2>Products</h2>

        <table id="products">
            <tr>
                <th>Id</th>
                <th>Name</th>
                <th>Price</th>
            </tr>
            <tr>
                <td>1</td>
                <td>Chocolate Sandwich Cookies</td>
                <td>3.50</td>
            </tr>
            <tr>
                <td>2</td>
                <td>All-Seasons Salt</td>
                <td>4.99</td>
            </tr>
            <tr>
                <td>3</td>
                <td>Robust Golden Unsweetened Oolong Tea</td>
                <td>2.49</td>
            </tr>
        </table>
    </body>
</html>

We repeat the process of fetching this data, as previously exemplified:

import requests
from bs4 import BeautifulSoup

# the URL of some HTML data or web page stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_tables.html"

response = requests.get(request_url)

soup = BeautifulSoup(response.text)
type(soup)
bs4.BeautifulSoup

Since the example HTML contains a table element with a unique identifier of "products", we can use the following code to access it:

table = soup.find("table", id="products")
print(type(ul))
table
<class 'bs4.element.Tag'>
<table id="products">
<tr>
<th>Id</th>
<th>Name</th>
<th>Price</th>
</tr>
<tr>
<td>1</td>
<td>Chocolate Sandwich Cookies</td>
<td>3.50</td>
</tr>
<tr>
<td>2</td>
<td>All-Seasons Salt</td>
<td>4.99</td>
</tr>
<tr>
<td>3</td>
<td>Robust Golden Unsweetened Oolong Tea</td>
<td>2.49</td>
</tr>
</table>

Getting all child rows (tr elements) from that table:

rows = table.find_all("tr")
print(type(rows))
print(len(rows))
rows
<class 'bs4.element.ResultSet'>
4
[<tr>
 <th>Id</th>
 <th>Name</th>
 <th>Price</th>
 </tr>,
 <tr>
 <td>1</td>
 <td>Chocolate Sandwich Cookies</td>
 <td>3.50</td>
 </tr>,
 <tr>
 <td>2</td>
 <td>All-Seasons Salt</td>
 <td>4.99</td>
 </tr>,
 <tr>
 <td>3</td>
 <td>Robust Golden Unsweetened Oolong Tea</td>
 <td>2.49</td>
 </tr>]

This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:

for tr in rows:
    cells = tr.find_all("td") # skip header row, which contains <th> elements instead
    if any(cells):
        print("-----------")
        # makes assumptions about the order of the cells:
        product_id = cells[0].text
        product_name = cells[1].text
        product_price = cells[2].text
        print(product_id, product_name, product_price)
-----------
1 Chocolate Sandwich Cookies 3.50
-----------
2 All-Seasons Salt 4.99
-----------
3 Robust Golden Unsweetened Oolong Tea 2.49