15  Joining and Merging

In practice, there are times when it is valuable to integrate multiple datasets into a single unified dataset, for further analysis.

This allows us to gain deeper insights and a more comprehensive understanding.

For example, a financial analyst may merge transaction-level data with customer demographic information to identify spending patterns, or combine market price data with macroeconomic indicators to better model asset performance.

If you’ve used the VLOOKUP function in spreadsheet software, or the JOIN clause in SQL, you’ve already merged datasets. Let’s explore how to merge datasets in Python.

15.1 Merging Data Frames

We can use the merge function from pandas to join two datasets together. The merge function accepts two DataFrame objects as initial parameters, as well as a how parameter to indicate the join strategy (see “Join Strategies” section below). There are additional parameters to denote which columns or indices to join on.

from pandas import merge

merged_df = merge(df1, df2, how="inner", on="id")
#merged_df = merge(merged_df, df3, how="inner", on="id")
#merged_df = merge(merged_df, df4, how="inner", on="id")
Note

FYI: the merge method of the DataFrame provides an alternative object-oriented interface:

merged_df = df1.merge(df2, how="inner", on="id")
#merged_df = merged_df.merge(df3, how="inner", on="id")
#merged_df = merged_df.merge(df4, how="inner", on="id")

In this approach, we invoke the merge method on one of the DataFrame objects, and pass the other as a parameter.

15.2 Join Strategies

When we read the documentation for these merge methods, we see there are many options for the how parameter, including “inner”, left”, “right”, and “outer”.

To know which value to choose in a particular situation, let’s discuss different these join strategies in more detail.

Join Strategies: inner vs outer joins
  • Inner Join: Keeps only rows where the join value is matching across both datasets.
  • Left Outer Join: Keeps all rows from the left dataset, and matches data from the right where available.
  • Right Outer Join: Keeps all rows from the right dataset and matches data from the left where available.
  • Full Outer Join: Keeps all rows and columns from both datasets. Values will be null if there isn’t a match.

15.3 Example 1: Customers and Transactions

To further illustrate the different join strategies, let’s consider the following two datasets about customers and transactions:

Code
from pandas import DataFrame

customers = DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ["Aisha", "Elena", "Carlos", "Kwame"],
    'Age': [25, 34, 29, 42],
    'City': ["New York", "Chicago", "New York", "Houston"]
})
print("CUSTOMERS:")
customers.head()
CUSTOMERS:
CustomerID Name Age City
0 1 Aisha 25 New York
1 2 Elena 34 Chicago
2 3 Carlos 29 New York
3 4 Kwame 42 Houston
Code
transactions = DataFrame({
    'CustomerID': [1, 2, 1, 5],
    'TransactionID': [1001, 1002, 1003, 1004],
    'Amount': [200.50, 150.75, 300.00, 400.25],
    'Date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18']
})
print("TRANSACTIONS:")
transactions.head()
TRANSACTIONS:
CustomerID TransactionID Amount Date
0 1 1001 200.50 2024-01-15
1 2 1002 150.75 2024-01-16
2 1 1003 300.00 2024-01-17
3 5 1004 400.25 2024-01-18
Note

It looks like there is a “one to many” relationship between the customers and their transactions, where a single customer may have multiple transactions.

Both of the datasets share a “CustomerID” column, which seems to be an index reference that uniquely identifies a given customer. There seem to be some values in the “CustomerId” column that are common across both datasets, so it will be possible to join the datasets on this basis.

print("CUSTOMER IDS PRESENT IN...")
print("CUSTOMERS DATASET:", customers["CustomerID"].unique().tolist())
print("TRANSACTIONS DATASET:", transactions["CustomerID"].unique().tolist())
CUSTOMER IDS PRESENT IN...
CUSTOMERS DATASET: [1, 2, 3, 4]
TRANSACTIONS DATASET: [1, 2, 5]

We see customers #1 and #2 are present in both datasets. However there isn’t a perfect overlap, as customer #5 is not present in the customers dataset, and customers #3 and #4 are not present in the transactions dataset.

Let’s join these same two datasets in a variety of different ways, to illustrate each join strategy.

15.3.1 Inner Join

Performing an inner join:

from pandas import merge

merged_df = merge(customers, transactions, how="inner", on="CustomerID")
merged_df
CustomerID Name Age City TransactionID Amount Date
0 1 Aisha 25 New York 1001 200.50 2024-01-15
1 1 Aisha 25 New York 1003 300.00 2024-01-17
2 2 Elena 34 Chicago 1002 150.75 2024-01-16
print("INNER JOIN:", merged_df["CustomerID"].unique().tolist())
INNER JOIN: [1, 2]

The merged dataset contains only the rows common across both source datasets. We see columns from both datasets, without any resulting null values.

15.3.2 Left Outer Join

Performing a left join:

merged_df = merge(customers, transactions, how="left", on="CustomerID")
merged_df
CustomerID Name Age City TransactionID Amount Date
0 1 Aisha 25 New York 1001.0 200.50 2024-01-15
1 1 Aisha 25 New York 1003.0 300.00 2024-01-17
2 2 Elena 34 Chicago 1002.0 150.75 2024-01-16
3 3 Carlos 29 New York NaN NaN NaN
4 4 Kwame 42 Houston NaN NaN NaN
print("LEFT JOIN:", merged_df["CustomerID"].unique().tolist())
LEFT JOIN: [1, 2, 3, 4]

The merged dataset contains the entire left dataset, with additional columns from the right. Merged values from the right will be null in the event there was no match.

15.3.3 Right Outer Join

Performing a right join:

merged_df = merge(customers, transactions, how="right", on="CustomerID")
merged_df
CustomerID Name Age City TransactionID Amount Date
0 1 Aisha 25.0 New York 1001 200.50 2024-01-15
1 2 Elena 34.0 Chicago 1002 150.75 2024-01-16
2 1 Aisha 25.0 New York 1003 300.00 2024-01-17
3 5 NaN NaN NaN 1004 400.25 2024-01-18
print("RIGHT JOIN:", merged_df["CustomerID"].unique().tolist())
RIGHT JOIN: [1, 2, 5]

The merged dataset contains the entire right dataset, with additional columns from the left. Merged values from the left will be null in the event there was no match.

15.3.4 Full Outer Join

Performing a full outer join:

merged_df = merge(customers, transactions, how="outer", on="CustomerID")
merged_df
CustomerID Name Age City TransactionID Amount Date
0 1 Aisha 25.0 New York 1001.0 200.50 2024-01-15
1 1 Aisha 25.0 New York 1003.0 300.00 2024-01-17
2 2 Elena 34.0 Chicago 1002.0 150.75 2024-01-16
3 3 Carlos 29.0 New York NaN NaN NaN
4 4 Kwame 42.0 Houston NaN NaN NaN
5 5 NaN NaN NaN 1004.0 400.25 2024-01-18
print("FULL OUTER JOIN:", merged_df["CustomerID"].unique().tolist())
FULL OUTER JOIN: [1, 2, 3, 4, 5]

We see all the rows and columns from both datasets. The values from the left and right are null if there wasn’t a match.

15.4 Example 2: Treasury Yield Curves

Let’s practice merging different datasets together, in order to perform some real world financial analysis. e will merge the data to explore insights about inverted yield curves.

In this example, we’ll use two different datasets of treasury yields from the AlphaVantage API: one with a short-term (three month) maturity, and the other with a long-term (ten year) maturity.

Data Source

These datasets use the Treasury Yield endpoint from the AlphaVantage API. This endpoint returns the daily, weekly, or monthly US treasury yield for a given maturity timeline (3-month, 10-year, 30-year, etc.).

Source: Board of Governors of the Federal Reserve System (US), Market Yield on U.S. Treasury Securities at 3-month, 2-year, 5-year, 7-year, 10-year, and 30-year Constant Maturities, Quoted on an Investment Basis, retrieved from FRED, Federal Reserve Bank of St. Louis.

15.4.1 Dataset A: Long-term Maturity

Fetching a dataset of long-term treasury yields maturing in ten years:

df_10y = fetch_treasury_yields(maturity="10year")
df_10y.head()
10year
timestamp
2024-11-01 4.36
2024-10-01 4.10
2024-09-01 3.72
2024-08-01 3.87
2024-07-01 4.25
print("TIME RANGE (10 YEAR MATURITY):")
print(df_10y.index.min(), "...", df_10y.index.max())
TIME RANGE (10 YEAR MATURITY):
1953-04-01 ... 2024-11-01

We see this dataset structure is a row per month, with a “timestamp” index covering a period from 1953 through the present.

15.4.2 Dataset B: Short-term Maturity

Fetching a dataset of short-term treasury yields:

df_3mo = fetch_treasury_yields(maturity="3month")
df_3mo.head()
3month
timestamp
2024-11-01 4.62
2024-10-01 4.72
2024-09-01 4.92
2024-08-01 5.30
2024-07-01 5.43
print("TIME RANGE (3 MONTH MATURITY):")
print(df_3mo.index.min(), "...", df_3mo.index.max())
TIME RANGE (3 MONTH MATURITY):
1981-09-01 ... 2024-11-01

We see this dataset structure is a row per month, with a “timestamp” index covering a period from 1981 through the present.

15.5 Merged Dataset

Let’s merge these datasets together, on the basis of their common values (in this case the “timestamp” index common across both datasets):

merged_df = df_10y.merge(df_3mo, how="inner", left_index=True, right_index=True)
merged_df.head()
10year 3month
timestamp
2024-11-01 4.36 4.62
2024-10-01 4.10 4.72
2024-09-01 3.72 4.92
2024-08-01 3.87 5.30
2024-07-01 4.25 5.43

In this case we specify we want to join on the index from both datasets. Here we are using an “inner” join strategy, to keep only the rows that have matching timestamp values across both datasets.

print("TIME RANGE (MERGED DATASET):")
print(merged_df.index.min())
print(merged_df.index.max())
TIME RANGE (MERGED DATASET):
1981-09-01
2024-11-01

Notice, the resulting merged dataset starts in 1981, because we used an “inner” join, and because that is the earliest month shared across ALL source datasets (as constrained by the 3-mo maturity dataset).

15.6 Analyzing Merged Data

The reason why we went through the trouble of merging the data is so we can compare the two datasets more easily, for example to identify which periods had an “inverted yield curve” where the short-term yield was greater than the long-term yield.

We can identify periods of inversion visually by plotting both datasets on the same graph:

import plotly.express as px

px.line(merged_df, y=["10year", "3month"], height=450,
    title="US Treasury Yields",
    labels={"timestamp": "Date", "value": "Yield (%)", "variable": "Maturity"},
    color_discrete_map={"10year": "steelblue", "3month": "orange"},
)

Alternative plotting with shaded regions of inversion:

Code
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
  go.Scatter(x=merged_df.index, y=merged_df["10year"],
            name="10year", mode="lines", line=dict(color="steelblue")
      )
)

fig.add_trace(
  go.Scatter(x=merged_df.index, y=merged_df["3month"],
            name="3month", mode="lines", line=dict(color="orange")
            )
)

condition = merged_df["3month"] > merged_df["10year"]
shaded_regions = []
start = None

for i in range(len(condition)):
    if condition.iloc[i] and start is None:
        start = merged_df.index[i]
    elif not condition.iloc[i] and start is not None:
        shaded_regions.append((start, merged_df.index[i]))
        start = None

# If the last region is still open
if start is not None:
    shaded_regions.append((start, merged_df.index[-1]))

# Add shaded regions to the plot
for start, end in shaded_regions:
    fig.add_shape(
        type="rect",
        x0=start, x1=end,
        y0=min(merged_df["3month"].min(), merged_df["10year"].min()),
        y1=max(merged_df["3month"].max(), merged_df["10year"].max()),
        fillcolor="rgba(255, 165, 0, 0.3)",  # Semi-transparent orange
        line_width=0,
    )

fig.update_layout(
    title="US Treasury Yields",
    xaxis_title="Date",
    yaxis_title="Yield (%)",
    legend_title="Maturity",
    height=450
    #template="plotly_white"
)

fig.show()