Joining and Merging

In real world data analyses, it may be valuable to integrate multiple datasets, to gain insights.

Let’s practice merging different datasets together.

In this example, we’ll use a number of different economic and market indicator datasets from the AlphaVantage API.

After fetching the raw data, we’ll prepare each source dataset in such a way that allows it to be merged with the others later. In order for the datasets to be merged, they need some common column of values. Think of a VLOOKUP function in spreadsheet software, or a JOIN clause in an SQL query. We will be performing a similar operation here.

Join strategies: inner vs outer

After we fetch the data, you’ll see of these datasets has a timestamp column, which represents the values using a monthly frequency.

Because we’ll have a timestamp column in each of the datasets, and because the values of this timestamp column will be expressed in the same way (i.e. at the beginning of each month), this allows us to use these common values to merge the datasets.

Fetching Data

We have obtained an AlphaVantage API Key and set it as a notebook secret.

from google.colab import userdata

API_KEY = userdata.get("ALPHAVANTAGE_API_KEY")

Inflation

https://www.alphavantage.co/documentation/#inflation

from pandas import read_csv

request_url = f"https://www.alphavantage.co/query?function=INFLATION&apikey={API_KEY}&datatype=csv"
inflation = read_csv(request_url)
inflation.head()
timestamp value
0 2022-01-01 8.002800
1 2021-01-01 4.697859
2 2020-01-01 1.233584
3 2019-01-01 1.812210
4 2018-01-01 2.442583

Here we have annual data. Doesn’t look like the endpoint provides more frequent intervals. So let’s not use this.

Consumer Price Index (CPI)

https://www.alphavantage.co/documentation/#cpi

CPI is widely regarded as the barometer of inflation levels in the broader economy.

The CPI endpoint does provide access to monthly data. So let’s use CPI as our desired measure of inflation.

from pandas import read_csv

request_url = f"https://www.alphavantage.co/query?function=CPI&interval=monthly&apikey={API_KEY}&datatype=csv"
cpi_df = read_csv(request_url)
cpi_df.rename(columns={"value": "cpi"}, inplace=True)
cpi_df.head()
timestamp cpi
0 2024-05-01 314.069
1 2024-04-01 313.548
2 2024-03-01 312.332
3 2024-02-01 310.326
4 2024-01-01 308.417
print("EARLIEST:", cpi_df.iloc[-1]["timestamp"])
EARLIEST: 1913-01-01
import plotly.express as px

fig = px.line(cpi_df, x="timestamp", y="cpi", title="Consumer Price Index (CPI) by Month", height=350)
fig.show()

Federal Funds Rate

https://www.alphavantage.co/documentation/#interest-rate

request_url = f"https://www.alphavantage.co/query?function=FEDERAL_FUNDS_RATE&interval=monthly&apikey={API_KEY}&datatype=csv"
fed_funds_df = read_csv(request_url)
fed_funds_df.rename(columns={"value": "fed"}, inplace=True)
fed_funds_df.head()
timestamp fed
0 2024-05-01 5.33
1 2024-04-01 5.33
2 2024-03-01 5.33
3 2024-02-01 5.33
4 2024-01-01 5.33
print("EARLIEST:", fed_funds_df["timestamp"].min())
EARLIEST: 1954-07-01
import plotly.express as px

px.line(fed_funds_df, x="timestamp", y="fed", title="Federal Funds Rate by Month", height=350)
px.histogram(fed_funds_df, x="fed", #nbins=12,
        title="Distribution of Federal Funds Rate (Monthly)", height=350)

The Market (S&P 500)

https://www.investopedia.com/articles/investing/122215/spy-spdr-sp-500-trust-etf.asp

The SPDR S&P 500 ETF Trust is one of the most popular funds. It aims to track the Standard & Poor’s (S&P) 500 Index, which comprises 500 large-cap U.S. stocks. These stocks are selected by a committee based on market size, liquidity, and industry. The S&P 500 serves as one of the main benchmarks of the U.S. equity market and indicates the financial health and stability of the economy

https://www.alphavantage.co/documentation/#monthlyadj

We can use the “SPY” ETF as a measure of the market. Looks like the data only covers the past 20 years (see endpoint docs).

from pandas import read_csv

request_url = f"https://www.alphavantage.co/query?function=TIME_SERIES_MONTHLY_ADJUSTED&symbol=SPY&apikey={API_KEY}&datatype=csv"
spy_df = read_csv(request_url)
spy_df.drop(columns=["open", "high", "low", "close", "volume", "dividend amount"], inplace=True)
spy_df.rename(columns={"adjusted close": "spy"}, inplace=True)
spy_df.head()
timestamp spy
0 2024-06-27 546.3700
1 2024-05-31 525.6718
2 2024-04-30 500.3636
3 2024-03-28 521.3857
4 2024-02-29 504.8645
print("ROWS:", len(spy_df))
print("EARLIEST:", spy_df["timestamp"].min())
ROWS: 295
EARLIEST: 1999-12-31
# standardizing the timestamp values so we can merge on them later
# we have to decide to treat "2023-05-31" as "2023-05" or "2023-06"
# since we see the latest value represents the current incompleted month,
# let's "round down" the monthly values

from pandas import to_datetime

spy_df["timestamp"] = to_datetime(spy_df["timestamp"]).dt.strftime("%Y-%m-01")
spy_df.head()
timestamp spy
0 2024-06-01 546.3700
1 2024-05-01 525.6718
2 2024-04-01 500.3636
3 2024-03-01 521.3857
4 2024-02-01 504.8645
from pandas import to_datetime

# packaging up this code into a reusable function becuse we'll have to perform this same operation on multiple datasets (see cells below related to gold and bitcoin)

def round_down_monthly_timestamp(original_df):
    """ Param original_df: pandas DataFrame that has a "timestamp" column of values representing each month"""
    # standardizing the timestamp values so we can merge on them later
    # we have to decide to treat "2023-05-31" as "2023-05-01" or "2023-06-01"
    # since we see the latest value represents the current incompleted month,
    # let's "round down" the monthly values
    # see: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
    original_df["timestamp"] = to_datetime(original_df["timestamp"]).dt.strftime("%Y-%m-01")


round_down_monthly_timestamp(spy_df)
spy_df.head()
timestamp spy
0 2024-06-01 546.3700
1 2024-05-01 525.6718
2 2024-04-01 500.3636
3 2024-03-01 521.3857
4 2024-02-01 504.8645
import plotly.express as px

px.line(spy_df, x="timestamp", y="spy", title="S&P 500 (SPY ETF) Prices by Month", height=350)

Gold

https://www.investopedia.com/articles/investing/122515/gld-ishares-gold-trust-etf.asp

The SPDR Gold Shares ETF (GLD) tracks the price of gold bullion in the over-the-counter (OTC) market.

https://money.usnews.com/investing/funds/slideshows/best-gold-etfs-to-hedge-volatility

The largest gold exchange-traded fund, or ETF, by a wide margin is the SPDR Gold Trust… And as the fund is benchmarked to physical gold, you can get a direct play on gold bullion prices via this ETF.

OK we can perhaps use the “GLD” index fund as a measure of gold prices.

from pandas import read_csv

request_url = f"https://www.alphavantage.co/query?function=TIME_SERIES_MONTHLY_ADJUSTED&symbol=GLD&apikey={API_KEY}&datatype=csv"
gld_df = read_csv(request_url)
gld_df.drop(columns=["open", "high", "low", "close", "volume", "dividend amount"], inplace=True)
gld_df.rename(columns={"adjusted close": "gld"}, inplace=True)
gld_df.head()
timestamp gld
0 2024-06-27 214.99
1 2024-05-31 215.30
2 2024-04-30 211.87
3 2024-03-28 205.72
4 2024-02-29 189.31
print("ROWS:", len(gld_df))
print("EARLIEST:", gld_df["timestamp"].min())
ROWS: 235
EARLIEST: 2004-12-31
round_down_monthly_timestamp(gld_df)
gld_df.head()
timestamp gld
0 2024-06-01 214.99
1 2024-05-01 215.30
2 2024-04-01 211.87
3 2024-03-01 205.72
4 2024-02-01 189.31
import plotly.express as px

px.line(gld_df, x="timestamp", y="gld", title="Gold (GLD ETF) Prices by Month", height=350)

Bitcoin

https://www.alphavantage.co/documentation/#currency-monthly

The earliest Bitcoin data we have is from 2020.

request_url = f"https://www.alphavantage.co/query?function=DIGITAL_CURRENCY_MONTHLY&symbol=BTC&market=USD&apikey={API_KEY}&datatype=csv"
btc_df = read_csv(request_url)
btc_df = btc_df[["timestamp", "close"]]
btc_df.rename(columns={"close": "btc"}, inplace=True)
print(len(btc_df))
btc_df.head()
9
timestamp btc
0 2024-06-28 61499.77
1 2024-05-31 67472.41
2 2024-04-30 60622.10
3 2024-03-31 71291.28
4 2024-02-29 61179.03
print("ROWS:", len(btc_df))
print("EARLIEST:", btc_df["timestamp"].min())
ROWS: 9
EARLIEST: 2023-10-31
round_down_monthly_timestamp(btc_df)
btc_df.head()
timestamp btc
0 2024-06-01 61499.77
1 2024-05-01 67472.41
2 2024-04-01 60622.10
3 2024-03-01 71291.28
4 2024-02-01 61179.03
import plotly.express as px

px.line(btc_df, x="timestamp", y="btc", title="Bitcoin Prices by Month, in USD", height=350)

Merging Data

Let’s merge all datasets together, on the basis of their common date values (i.e. the “timestamp” column common across all datasets).

We use a precise merge on matching timestamp values, instead of making assumptions about the row frequency and order of source datasets. In this way, the merge operation is similar to a VLOOKUP operation in spreadsheet software.

You’ll notice we have been renaming the columns in the source datasets, and ensuring their “timestamp” values are represented in a standardized way (i.e. all at the beginning of month), to facilitate a clean merge on these common values.

print(cpi_df.columns.tolist())
print(fed_funds_df.columns.tolist())
print(spy_df.columns.tolist())
print(gld_df.columns.tolist())
#print(btc_df.columns.tolist())
['timestamp', 'cpi']
['timestamp', 'fed']
['timestamp', 'spy']
['timestamp', 'gld']
cpi_df.head()
timestamp cpi
0 2024-05-01 314.069
1 2024-04-01 313.548
2 2024-03-01 312.332
3 2024-02-01 310.326
4 2024-01-01 308.417
fed_funds_df.head()
timestamp fed
0 2024-05-01 5.33
1 2024-04-01 5.33
2 2024-03-01 5.33
3 2024-02-01 5.33
4 2024-01-01 5.33
spy_df.head()
timestamp spy
0 2024-06-01 546.3700
1 2024-05-01 525.6718
2 2024-04-01 500.3636
3 2024-03-01 521.3857
4 2024-02-01 504.8645
gld_df.head()
timestamp gld
0 2024-06-01 214.99
1 2024-05-01 215.30
2 2024-04-01 211.87
3 2024-03-01 205.72
4 2024-02-01 189.31

In order to merge the datasets, we’ll use the merge method.

We start by referencing one of the dataframes we want to merge, and then invoke the merge method on it. We pass as a parameter to the merge method another dataframe we want to merge, and also specify the join strategy. In this case we specify which column to join on, and how to join.

Here we are using an “inner” join strategy, to keep only the rows that have matching timestamp values across all datasets.

We can merge any number of additional dataframes in the same way, until we arrive at the final merged dataset.

df = cpi_df.merge(fed_funds_df, on="timestamp", how="inner")
df = df.merge(spy_df, on="timestamp", how="inner")
df = df.merge(gld_df, on="timestamp", how="inner")
#df = df.merge(btc_df, on="timestamp", how="inner")
df.index = df["timestamp"]
df.tail()
timestamp cpi fed spy gld
timestamp
2005-04-01 2005-04-01 194.6 2.79 80.2135 43.35
2005-03-01 2005-03-01 193.3 2.63 81.7450 42.82
2005-02-01 2005-02-01 191.8 2.50 83.2672 43.52
2005-01-01 2005-01-01 190.7 2.28 81.5622 42.22
2004-12-01 2004-12-01 190.3 2.16 83.4328 43.80

Notice, the resulting merged dataset starts in 2020, because that is the earliest data available across ALL datasets (as constrained by the Bitcoin dataset). For analyses that only involve two of these indicators, it may be worth it to create a separate merged dataset from only those two (not including Bitcoin), to obtain as much historical context as possible.

len(df)
234

Now that we have a single merged dataset, let’s save it for later, so we can use it to illustrate some basic statistical concepts and techniques.

df.to_csv("monthly-indicators.csv", index=False)