30 Fetching XML Data

If the data we want to fetch is in XML format, including in an RSS feed, we can use the requests package to fetch it, and the beautifulsoup4 package to process it.

Let’s consider this example "gradebook.xml" file we have hosted on the Internet:

<GradeReport>
    <DownloadDate>2018-06-05</DownloadDate>
    <ProfessorId>123</ProfessorId>
    <Students>
        <Student>
            <StudentId>1</StudentId>
            <FinalGrade>76.7</FinalGrade>
        </Student>
        <Student>
            <StudentId>2</StudentId>
            <FinalGrade>85.1</FinalGrade>
        </Student>
        <Student>
            <StudentId>3</StudentId>
            <FinalGrade>50.3</FinalGrade>
        </Student>
        <Student>
            <StudentId>4</StudentId>
            <FinalGrade>89.8</FinalGrade>
        </Student>
        <Student>
            <StudentId>5</StudentId>
            <FinalGrade>97.4</FinalGrade>
        </Student>
        <Student>
            <StudentId>6</StudentId>
            <FinalGrade>75.5</FinalGrade>
        </Student>
        <Student>
            <StudentId>7</StudentId>
            <FinalGrade>87.2</FinalGrade>
        </Student>
        <Student>
            <StudentId>8</StudentId>
            <FinalGrade>88.0</FinalGrade>
        </Student>
        <Student>
            <StudentId>9</StudentId>
            <FinalGrade>93.9</FinalGrade>
        </Student>
        <Student>
            <StudentId>10</StudentId>
            <FinalGrade>92.5</FinalGrade>
        </Student>
    </Students>
</GradeReport>

First we note the URL of where the data resides. Then we pass that as a parameter to the get function from the requests package, to issue an HTTP GET request (as usual):

import requests

# the URL of some XML data we stored online:
request_url = ("https://raw.githubusercontent.com/" +
              "prof-rossetti/intro-software-dev-python-book/main/docs/data/" +
              "gradebook.xml")

response = requests.get(request_url)
print(type(response))

<class 'requests.models.Response'>

Then we pass the response text (an HTML or XML formatted string) to the BeautifulSoup class constructor.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text)
type(soup)

bs4.BeautifulSoup

30.1 Finding Elements

The resulting soup object is able to intelligently process the data. We can use the soup’s finder methods to search for specific data elements, called “tags”, based on their names or other attributes. If we want to return the first matching element, we use the find method, whereas if we want to get all matching elements, we use the find_all method.

For example, finding all the student tags in this structure:

students = soup.find_all("student")
print(type(students))
print(len(students))

<class 'bs4.element.ResultSet'>
10

Examining the first item for reference:

print(type(students[0]))
students[0]

<class 'bs4.element.Tag'>

<student>
<studentid>1</studentid>
<finalgrade>76.7</finalgrade>
</student>

Looping through all the items:

for student in students:
    print("-----------")
    print(type(student))

    student_id = student.studentid.text
    final_grade = student.finalgrade.text
    print(student_id, final_grade)

-----------
<class 'bs4.element.Tag'>
1 76.7
-----------
<class 'bs4.element.Tag'>
2 85.1
-----------
<class 'bs4.element.Tag'>
3 50.3
-----------
<class 'bs4.element.Tag'>
4 89.8
-----------
<class 'bs4.element.Tag'>
5 97.4
-----------
<class 'bs4.element.Tag'>
6 75.5
-----------
<class 'bs4.element.Tag'>
7 87.2
-----------
<class 'bs4.element.Tag'>
8 88.0
-----------
<class 'bs4.element.Tag'>
9 93.9
-----------
<class 'bs4.element.Tag'>
10 92.5

Calculating the average grade:

from statistics import mean, median

grades = [float(student.finalgrade.text) for student in students]

print(mean(grades))
print(median(grades))

83.64
87.6