25 Fetching Data from the Internet
So far, we have studied many data processing techniques using example data.
But the real fun comes from working with real life data.
The Internet is a boundless source of data, across a variety of domains, including financial data.
25.1 Data Formats
In practice, when we fetch data from the Internet, it may come in one of a handful of data formats, including JSON, CSV, XML, and HTML.
JSON are nested structures. Like the lists and dictionaries we have studied in Python.
CSV is tabular data. Like a spreadsheet, with rows and columns.
XML and HTML contain structured "tags" to denote the document structure. Like a webpage.
Example data structures in each format are provided below.
25.1.1 JSON Data Format
Example JSON:
["city": "New York", "name": "Yankees", "league":"Major"},
{"city": "New York", "name": "Mets", "league":"Major"},
{"city": "Boston", "name": "Red Sox", "league":"Major"},
{"city": "Washington", "name": "Nationals", "league":"Major"},
{"city": "New Haven", "name": "Ravens", "league":"Minor"}
{ ]
25.1.2 CSV Data Format
Example CSV:
city,name,league
New York,Mets,Major
New York,Yankees,Major
Boston,Red Sox,Major
Washington,Nationals,Major
New Haven,Ravens,Minor
25.1.3 XML Data Format
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
teams>
<team>
<city>New York</city>
<league>Major</league>
<name>Yankees</name>
<team>
</team>
<city>New York</city>
<league>Major</league>
<name>Mets</name>
<team>
</team>
<city>Boston</city>
<league>Major</league>
<name>Red Sox</name>
<team>
</team>
<city>Washington</city>
<league>Major</league>
<name>Nationals</name>
<team>
</team>
<city>New Haven</city>
<league>Minor</league>
<name>Ravens</name>
<team>
</teams> </
25.2 Data Fetching Strategies
The strategy we use to fetch data depends on the format of the data we are requesting:
If we have JSON data, we will use the
requests
package to fetch and parse it.If we have CSV data, we will use the
pandas
package to fetch and parse it.If we have HTML or XML data, we will generally use the
requests
package to fetch it, and theBeautifulSoup
package to parse it.
Let’s dive into each one of these methods in detail: