Fetching Data from the Internet

So far, we have studied many data processing techniques using example data.

But the real fun comes from working with real life data.

The Internet is a boundless source of data, including financial data.

Data Formats

In practice, when we fetch data from the internet, it may come in one of a handful of data formats, including JSON, CSV, XML, and HTML.

  • JSON are nested structures. Like the lists and dictionaries we have studied in Python.

  • CSV is tabular data. Like a spreadsheet, with rows and columns.

  • XML and HTML contain structured "tags" to denote the document structure. Like a webpage.

Example JSON:

[
  {"city": "New York", "name": "Yankees", "league":"Major"},
  {"city": "New York", "name": "Mets", "league":"Major"},
  {"city": "Boston", "name": "Red Sox", "league":"Major"},
  {"city": "Washington", "name": "Nationals", "league":"Major"},
  {"city": "New Haven", "name": "Ravens", "league":"Minor"}
]

Example CSV:

city,name,league
New York,Mets,Major
New York,Yankees,Major
Boston,Red Sox,Major
Washington,Nationals,Major
New Haven,Ravens,Minor

Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<teams>
  <team>
    <city>New York</city>
    <league>Major</league>
    <name>Yankees</name>
  </team>
  <team>
    <city>New York</city>
    <league>Major</league>
    <name>Mets</name>
  </team>
  <team>
    <city>Boston</city>
    <league>Major</league>
    <name>Red Sox</name>
  </team>
  <team>
    <city>Washington</city>
    <league>Major</league>
    <name>Nationals</name>
  </team>
  <team>
    <city>New Haven</city>
    <league>Minor</league>
    <name>Ravens</name>
  </team>
</teams>

The strategy we use to fetch data depends on the format of the data we are requesting:

  • If we have JSON data, we will use the requests package to fetch and parse it.

  • If we have CSV data, we will use the pandas package to fetch and parse it.

  • If we have HTML or XML data, we will generally use the requests package to fetch it, and the BeautifulSoup package to parse it.

Let’s dive into each one of these methods in detail: