Fetching XML Data

If the data you want to fetch is in XML format, including in an RSS feed, we can use the requests package to fetch it, and the beautifulsoup4 package to process it.

Let’s consider this example "students.xml" file we have hosted on the Internet:

<GradeReport>
    <DownloadDate>2018-06-05</DownloadDate>
    <ProfessorId>123</ProfessorId>
    <Students>
        <Student>
            <StudentId>1</StudentId>
            <FinalGrade>76.7</FinalGrade>
        </Student>
        <Student>
            <StudentId>2</StudentId>
            <FinalGrade>85.1</FinalGrade>
        </Student>
        <Student>
            <StudentId>3</StudentId>
            <FinalGrade>50.3</FinalGrade>
        </Student>
        <Student>
            <StudentId>4</StudentId>
            <FinalGrade>89.8</FinalGrade>
        </Student>
        <Student>
            <StudentId>5</StudentId>
            <FinalGrade>97.4</FinalGrade>
        </Student>
        <Student>
            <StudentId>6</StudentId>
            <FinalGrade>75.5</FinalGrade>
        </Student>
        <Student>
            <StudentId>7</StudentId>
            <FinalGrade>87.2</FinalGrade>
        </Student>
        <Student>
            <StudentId>8</StudentId>
            <FinalGrade>88.0</FinalGrade>
        </Student>
        <Student>
            <StudentId>9</StudentId>
            <FinalGrade>93.9</FinalGrade>
        </Student>
        <Student>
            <StudentId>10</StudentId>
            <FinalGrade>92.5</FinalGrade>
        </Student>
    </Students>
</GradeReport>

First we note the URL of where the data resides. Then we pass that as a parameter to the get function from the requests package, to issue an HTTP GET request (as usual):

import requests

# the URL of some XML data we stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml"

response = requests.get(request_url)
print(type(response))
<class 'requests.models.Response'>

Then we pass the response text (an HTML or XML formatted string) to the BeautifulSoup class constructor.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text)
type(soup)
bs4.BeautifulSoup

The soup object is able to intelligently process the data.

We can invoke a find or find_all method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure:

students = soup.find_all("student")
print(type(students))
print(len(students))
<class 'bs4.element.ResultSet'>
10
# examining the first item for reference:
print(type(students[0]))
students[0]
<class 'bs4.element.Tag'>
<student>
<studentid>1</studentid>
<finalgrade>76.7</finalgrade>
</student>
# looping through all the items:
for student in students:
    print("-----------")
    print(type(student))
    student_id = student.studentid.text
    final_grade = student.finalgrade.text
    print(student_id, final_grade)
-----------
<class 'bs4.element.Tag'>
1 76.7
-----------
<class 'bs4.element.Tag'>
2 85.1
-----------
<class 'bs4.element.Tag'>
3 50.3
-----------
<class 'bs4.element.Tag'>
4 89.8
-----------
<class 'bs4.element.Tag'>
5 97.4
-----------
<class 'bs4.element.Tag'>
6 75.5
-----------
<class 'bs4.element.Tag'>
7 87.2
-----------
<class 'bs4.element.Tag'>
8 88.0
-----------
<class 'bs4.element.Tag'>
9 93.9
-----------
<class 'bs4.element.Tag'>
10 92.5