If the data we want to fetch is in HTML format, like most web pages are, we can use the requests package to fetch it, and the beautifulsoup4 package to process it.
Before moving on to process HTML formatted data, it will be important to first review HTML format, using these resources from W3 Schools:
Let’s consider this "my_lists.html" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:
<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>HTML List Parsing Exercise</title></head><body><h1>HTML List Parsing Exercise</h1><p>This is an HTML page.</p><h2>Favorite Ice cream Flavors</h2><ol id="my-fav-flavors"><li>Vanilla Bean</li><li>Chocolate</li><li>Strawberry</li></ol><h2>Skills</h2><ul id="my-skills"><li class="skill">HTML</li><li class="skill">CSS</li><li class="skill">JavaScript</li><li class="skill">Python</li></ul></body></html>
First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the get function from the requests package, to issue an HTTP GET request (as usual):
import requests# the URL of some HTML data or web page stored online:request_url ="https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_lists.html"response = requests.get(request_url)print(type(response))
<class 'requests.models.Response'>
Then we pass the response text (an HTML formatted string) to the BeautifulSoup class constructor.
from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text)type(soup)
bs4.BeautifulSoup
29.2 Finding Elements
The resulting soup object is able to intelligently process the data. We can use the soup’s finder methods to search for specific data elements, called “tags”, based on their names or other attributes. If we want to return the first matching element, we use the find method, whereas if we want to get all matching elements, we use the find_all method.
29.2.1 Finding Elements by Identifier
Since the example HTML contains an ordered list (ol element) with a unique identifier of "my-fav-flavors", we can use the following code to access it:
ul = soup.find("ol", id="my-fav-flavors")print(type(ul))ul
In that first example, we accessed an item based on its unique identifier, but in this example we will access a number of items by their class. In HTML, only one element can have a given id, but many elements can be members of the same class.
Since the example HTML contains an unordered list (ul element) of skills, where each list item shares the same class of "skill", we can use the following code to access the list items directly:
# get all <li> elements that have a given class of "skill"skills = soup.find_all("li", "skill")print(type(skills))print(len(skills))skills
Let’s consider this "my_tables.html" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:
<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>HTML Table Parsing Exercise</title></head><body><h1>HTML Table Parsing Exercise</h1><p>This is an HTML page.</p><h2>Products</h2><table id="products"><tr><th>Id</th><th>Name</th><th>Price</th></tr><tr><td>1</td><td>Chocolate Sandwich Cookies</td><td>3.50</td></tr><tr><td>2</td><td>All-Seasons Salt</td><td>4.99</td></tr><tr><td>3</td><td>Robust Golden Unsweetened Oolong Tea</td><td>2.49</td></tr></table></body></html>
We repeat the process of fetching this data, as previously exemplified:
import requestsfrom bs4 import BeautifulSoup# the URL of some HTML data or web page stored online:request_url ="https://raw.githubusercontent.com/prof-rossetti/intro-software-dev-python-book/main/docs/data/my_tables.html"response = requests.get(request_url)soup = BeautifulSoup(response.text)type(soup)
bs4.BeautifulSoup
Since the example HTML contains a table element with a unique identifier of "products", we can use the following code to access it:
This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:
for tr in rows: cells = tr.find_all("td") # skip header row, which contains <th> elements insteadifany(cells):print("-----------")# makes assumptions about the order of the cells: product_id = cells[0].text product_name = cells[1].text product_price = cells[2].textprint(product_id, product_name, product_price)
-----------
1 Chocolate Sandwich Cookies 3.50
-----------
2 All-Seasons Salt 4.99
-----------
3 Robust Golden Unsweetened Oolong Tea 2.49