25  Fetching Data from the Internet

So far, we have studied many data processing techniques using example data.

But the real fun comes from working with real life data.

The Internet is a boundless source of data, across a variety of domains, including financial data.

25.1 Data Formats

In practice, when we fetch data from the Internet, it may come in one of a handful of data formats, including JSON, CSV, XML, and HTML.

  • JSON are nested structures. Like the lists and dictionaries we have studied in Python.

  • CSV is tabular data. Like a spreadsheet, with rows and columns.

  • XML and HTML contain structured "tags" to denote the document structure. Like a webpage.

Example data structures in each format are provided below.

25.1.1 JSON Data Format

Example JSON:

[
  {"city": "New York", "name": "Yankees", "league":"Major"},
  {"city": "New York", "name": "Mets", "league":"Major"},
  {"city": "Boston", "name": "Red Sox", "league":"Major"},
  {"city": "Washington", "name": "Nationals", "league":"Major"},
  {"city": "New Haven", "name": "Ravens", "league":"Minor"}
]

25.1.2 CSV Data Format

Example CSV:

city,name,league
New York,Mets,Major
New York,Yankees,Major
Boston,Red Sox,Major
Washington,Nationals,Major
New Haven,Ravens,Minor

25.1.3 XML Data Format

Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<teams>
  <team>
    <city>New York</city>
    <league>Major</league>
    <name>Yankees</name>
  </team>
  <team>
    <city>New York</city>
    <league>Major</league>
    <name>Mets</name>
  </team>
  <team>
    <city>Boston</city>
    <league>Major</league>
    <name>Red Sox</name>
  </team>
  <team>
    <city>Washington</city>
    <league>Major</league>
    <name>Nationals</name>
  </team>
  <team>
    <city>New Haven</city>
    <league>Minor</league>
    <name>Ravens</name>
  </team>
</teams>

25.2 Data Fetching Strategies

The strategy we use to fetch data depends on the format of the data we are requesting:

  • If we have JSON data, we will use the requests package to fetch and parse it.

  • If we have CSV data, we will use the pandas package to fetch and parse it.

  • If we have HTML or XML data, we will generally use the requests package to fetch it, and the BeautifulSoup package to parse it.

Let’s dive into each one of these methods in detail: