import requests
# the URL of some HTML data or web page stored online:
= "https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_lists.html"
request_url
= requests.get(request_url)
response print(type(response))
29 Fetching HTML Data (i.e. “Web Scraping”)
If the data we want to fetch is in HTML format, like most web pages are, we can use the requests
package to fetch it, and the beautifulsoup4
package to process it.
Before moving on to process HTML formatted data, it will be important to first review HTML format, using these resources from W3 Schools:
29.1 HTML Lists
Let’s consider this "my_lists.html" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>HTML List Parsing Exercise</title>
</head>
<body>
<h1>HTML List Parsing Exercise</h1>
<p>This is an HTML page.</p>
<h2>Favorite Ice cream Flavors</h2>
<ol id="my-fav-flavors">
<li>Vanilla Bean</li>
<li>Chocolate</li>
<li>Strawberry</li>
</ol>
<h2>Skills</h2>
<ul id="my-skills">
<li class="skill">HTML</li>
<li class="skill">CSS</li>
<li class="skill">JavaScript</li>
<li class="skill">Python</li>
</ul>
</body>
</html>
First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the get
function from the requests
package, to issue an HTTP GET request (as usual):
Then we pass the response text (an HTML formatted string) to the BeautifulSoup
class constructor.
from bs4 import BeautifulSoup
= BeautifulSoup(response.text)
soup type(soup)
29.2 Finding Elements
The resulting soup object is able to intelligently process the data. We can use the soup’s finder methods to search for specific data elements, called “tags”, based on their names or other attributes. If we want to return the first matching element, we use the find
method, whereas if we want to get all matching elements, we use the find_all
method.
29.2.1 Finding Elements by Identifier
Since the example HTML contains an ordered list (ol
element) with a unique identifier of "my-fav-flavors", we can use the following code to access it:
= soup.find("ol", id="my-fav-flavors")
ul print(type(ul))
ul
Getting all child <li>
elements from that list:
= ul.find_all("li")
flavors print(type(flavors))
print(len(flavors))
flavors
Looping through the items:
for li in flavors:
print("-----------")
print(type(li))
print(li.text)
29.2.2 Finding Elements by Class
In that first example, we accessed an item based on its unique identifier, but in this example we will access a number of items by their class. In HTML, only one element can have a given id
, but many elements can be members of the same class
.
Since the example HTML contains an unordered list (ul
element) of skills, where each list item shares the same class of "skill", we can use the following code to access the list items directly:
# get all <li> elements that have a given class of "skill"
= soup.find_all("li", "skill")
skills print(type(skills))
print(len(skills))
skills
Looping through the results:
for li in skills:
print("-----------")
print(type(li))
print(li.text)
29.3 HTML Tables
Let’s consider this "my_tables.html" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>HTML Table Parsing Exercise</title>
</head>
<body>
<h1>HTML Table Parsing Exercise</h1>
<p>This is an HTML page.</p>
<h2>Products</h2>
<table id="products">
<tr>
<th>Id</th>
<th>Name</th>
<th>Price</th>
</tr>
<tr>
<td>1</td>
<td>Chocolate Sandwich Cookies</td>
<td>3.50</td>
</tr>
<tr>
<td>2</td>
<td>All-Seasons Salt</td>
<td>4.99</td>
</tr>
<tr>
<td>3</td>
<td>Robust Golden Unsweetened Oolong Tea</td>
<td>2.49</td>
</tr>
</table>
</body>
</html>
We repeat the process of fetching this data, as previously exemplified:
import requests
from bs4 import BeautifulSoup
# the URL of some HTML data or web page stored online:
= "https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_tables.html"
request_url
= requests.get(request_url)
response
= BeautifulSoup(response.text)
soup type(soup)
Since the example HTML contains a table
element with a unique identifier of "products", we can use the following code to access it:
= soup.find("table", id="products")
table print(type(ul))
table
Getting all child rows (tr
elements) from that table:
= table.find_all("tr")
rows print(type(rows))
print(len(rows))
rows
This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:
for tr in rows:
= tr.find_all("td") # skip header row, which contains <th> elements instead
cells if any(cells):
print("-----------")
# makes assumptions about the order of the cells:
= cells[0].text
product_id = cells[1].text
product_name = cells[2].text
product_price print(product_id, product_name, product_price)