16 Joining and Merging

In practice, there are times when it is valuable to integrate multiple datasets into a single unified dataset, for further analysis.

This allows us to gain deeper insights and a more comprehensive understanding.

For example, a financial analyst may merge transaction-level data with customer demographic information to identify spending patterns, or combine market price data with macroeconomic indicators to better model asset performance.

If you’ve used the VLOOKUP function in spreadsheet software, or the JOIN clause in SQL, you’ve already merged datasets. Let’s explore how to merge datasets in Python.

16.1 Merging DataFrames

We can use the merge function from pandas to join two datasets together. The merge function accepts two DataFrame objects as initial parameters, as well as a how parameter to indicate the join strategy (see “Join Strategies” section below). There are additional parameters to denote which columns or indices to join on.

from pandas import merge

merged_df = merge(df1, df2, how="inner", on="id")
#merged_df = merge(merged_df, df3, how="inner", on="id")
#merged_df = merge(merged_df, df4, how="inner", on="id")

Note

FYI: the merge method of the DataFrame provides an alternative object-oriented interface:

merged_df = df1.merge(df2, how="inner", on="id")
#merged_df = merged_df.merge(df3, how="inner", on="id")
#merged_df = merged_df.merge(df4, how="inner", on="id")

In this approach, we invoke the merge method on one of the DataFrame objects, and pass the other as a parameter.

16.2 Join Strategies

When we read the documentation for these merge methods, we see there are many options for the how parameter, including “inner”, left”, “right”, and “outer”.

To know which value to choose in a particular situation, let’s discuss different these join strategies in more detail.

Inner Join: Keeps only rows where the join value is matching across both datasets.
Left Outer Join: Keeps all rows from the left dataset, and matches data from the right where available.
Right Outer Join: Keeps all rows from the right dataset and matches data from the left where available.
Full Outer Join: Keeps all rows and columns from both datasets. Values will be null if there isn’t a match.

16.3 Example 1: Customers and Transactions

To further illustrate the different join strategies, let’s consider the following two datasets about customers and transactions:

Code

from pandas import DataFrame

customers = DataFrame({
    'CustomerID': [1, 2, 3, 4],
    'Name': ["Aisha", "Elena", "Carlos", "Kwame"],
    'Age': [25, 34, 29, 42],
    'City': ["New York", "Chicago", "New York", "Houston"]
})
print("CUSTOMERS:")
customers.head()

CUSTOMERS:

	CustomerID	Name	Age	City
0	1	Aisha	25	New York
1	2	Elena	34	Chicago
2	3	Carlos	29	New York
3	4	Kwame	42	Houston

Code

transactions = DataFrame({
    'CustomerID': [1, 2, 1, 5],
    'TransactionID': [1001, 1002, 1003, 1004],
    'Amount': [200.50, 150.75, 300.00, 400.25],
    'Date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18']
})
print("TRANSACTIONS:")
transactions.head()

TRANSACTIONS:

	CustomerID	TransactionID	Amount	Date
0	1	1001	200.50	2024-01-15
1	2	1002	150.75	2024-01-16
2	1	1003	300.00	2024-01-17
3	5	1004	400.25	2024-01-18

Note

It looks like there is a “one to many” relationship between the customers and their transactions, where a single customer may have multiple transactions.

Both of the datasets share a “CustomerID” column, which seems to be an index reference that uniquely identifies a given customer. There seem to be some values in the “CustomerId” column that are common across both datasets, so it will be possible to join the datasets on this basis.

print("CUSTOMER IDS PRESENT IN...")
print("CUSTOMERS DATASET:", customers["CustomerID"].unique().tolist())
print("TRANSACTIONS DATASET:", transactions["CustomerID"].unique().tolist())

CUSTOMER IDS PRESENT IN...
CUSTOMERS DATASET: [1, 2, 3, 4]
TRANSACTIONS DATASET: [1, 2, 5]

We see customers #1 and #2 are present in both datasets. However there isn’t a perfect overlap, as customer #5 is not present in the customers dataset, and customers #3 and #4 are not present in the transactions dataset.

Let’s join these same two datasets in a variety of different ways, to illustrate each join strategy.

16.3.1 Inner Join

Performing an inner join:

from pandas import merge

merged_df = merge(customers, transactions, how="inner", on="CustomerID")
merged_df

	CustomerID	Name	Age	City	TransactionID	Amount	Date
0	1	Aisha	25	New York	1001	200.50	2024-01-15
1	1	Aisha	25	New York	1003	300.00	2024-01-17
2	2	Elena	34	Chicago	1002	150.75	2024-01-16

print("INNER JOIN:", merged_df["CustomerID"].unique().tolist())

INNER JOIN: [1, 2]

The merged dataset contains only the rows common across both source datasets. We see columns from both datasets, without any resulting null values.

16.3.2 Left Outer Join

Performing a left join:

merged_df = merge(customers, transactions, how="left", on="CustomerID")
merged_df

	CustomerID	Name	Age	City	TransactionID	Amount	Date
0	1	Aisha	25	New York	1001.0	200.50	2024-01-15
1	1	Aisha	25	New York	1003.0	300.00	2024-01-17
2	2	Elena	34	Chicago	1002.0	150.75	2024-01-16
3	3	Carlos	29	New York	NaN	NaN	NaN
4	4	Kwame	42	Houston	NaN	NaN	NaN

print("LEFT JOIN:", merged_df["CustomerID"].unique().tolist())

LEFT JOIN: [1, 2, 3, 4]

The merged dataset contains the entire left dataset, with additional columns from the right. Merged values from the right will be null in the event there was no match.

16.3.3 Right Outer Join

Performing a right join:

merged_df = merge(customers, transactions, how="right", on="CustomerID")
merged_df

	CustomerID	Name	Age	City	TransactionID	Amount	Date
0	1	Aisha	25.0	New York	1001	200.50	2024-01-15
1	2	Elena	34.0	Chicago	1002	150.75	2024-01-16
2	1	Aisha	25.0	New York	1003	300.00	2024-01-17
3	5	NaN	NaN	NaN	1004	400.25	2024-01-18

print("RIGHT JOIN:", merged_df["CustomerID"].unique().tolist())

RIGHT JOIN: [1, 2, 5]

The merged dataset contains the entire right dataset, with additional columns from the left. Merged values from the left will be null in the event there was no match.

16.3.4 Full Outer Join

Performing a full outer join:

merged_df = merge(customers, transactions, how="outer", on="CustomerID")
merged_df

	CustomerID	Name	Age	City	TransactionID	Amount	Date
0	1	Aisha	25.0	New York	1001.0	200.50	2024-01-15
1	1	Aisha	25.0	New York	1003.0	300.00	2024-01-17
2	2	Elena	34.0	Chicago	1002.0	150.75	2024-01-16
3	3	Carlos	29.0	New York	NaN	NaN	NaN
4	4	Kwame	42.0	Houston	NaN	NaN	NaN
5	5	NaN	NaN	NaN	1004.0	400.25	2024-01-18

print("FULL OUTER JOIN:", merged_df["CustomerID"].unique().tolist())

FULL OUTER JOIN: [1, 2, 3, 4, 5]

We see all the rows and columns from both datasets. The values from the left and right are null if there wasn’t a match.

16.4 Example 2: Treasury Yield Curves

Let’s practice merging different datasets together, in order to perform some real world financial analysis. e will merge the data to explore insights about inverted yield curves.

In this example, we’ll use two different datasets of treasury yields from the AlphaVantage API: one with a short-term (three month) maturity, and the other with a long-term (ten year) maturity.

Data Source

These datasets use the Treasury Yield endpoint from the AlphaVantage API. This endpoint returns the daily, weekly, or monthly US treasury yield for a given maturity timeline (3-month, 10-year, 30-year, etc.).

Source: Board of Governors of the Federal Reserve System (US), Market Yield on U.S. Treasury Securities at 3-month, 2-year, 5-year, 7-year, 10-year, and 30-year Constant Maturities, Quoted on an Investment Basis, retrieved from FRED, Federal Reserve Bank of St. Louis.

16.4.1 Dataset A: Long-term Maturity

Fetching a dataset of long-term treasury yields maturing in ten years:

df_10y = fetch_treasury_yields(maturity="10year")
df_10y.head()

	10year
timestamp
2024-11-01	4.36
2024-10-01	4.10
2024-09-01	3.72
2024-08-01	3.87
2024-07-01	4.25

print("TIME RANGE (10 YEAR MATURITY):")
print(df_10y.index.min(), "...", df_10y.index.max())

TIME RANGE (10 YEAR MATURITY):
1953-04-01 ... 2024-11-01

We see this dataset structure is a row per month, with a “timestamp” index covering a period from 1953 through the present.

16.4.2 Dataset B: Short-term Maturity

Fetching a dataset of short-term treasury yields:

df_3mo = fetch_treasury_yields(maturity="3month")
df_3mo.head()

	3month
timestamp
2024-11-01	4.62
2024-10-01	4.72
2024-09-01	4.92
2024-08-01	5.30
2024-07-01	5.43

print("TIME RANGE (3 MONTH MATURITY):")
print(df_3mo.index.min(), "...", df_3mo.index.max())

TIME RANGE (3 MONTH MATURITY):
1981-09-01 ... 2024-11-01

We see this dataset structure is a row per month, with a “timestamp” index covering a period from 1981 through the present.

16.5 Merged Dataset

Let’s merge these datasets together, on the basis of their common values (in this case the “timestamp” index common across both datasets):

merged_df = df_10y.merge(df_3mo, how="inner", left_index=True, right_index=True)
merged_df.head()

	10year	3month
timestamp
2024-11-01	4.36	4.62
2024-10-01	4.10	4.72
2024-09-01	3.72	4.92
2024-08-01	3.87	5.30
2024-07-01	4.25	5.43

In this case we specify we want to join on the index from both datasets. Here we are using an “inner” join strategy, to keep only the rows that have matching timestamp values across both datasets.

print("TIME RANGE (MERGED DATASET):")
print(merged_df.index.min())
print(merged_df.index.max())

TIME RANGE (MERGED DATASET):
1981-09-01
2024-11-01

Notice, the resulting merged dataset starts in 1981, because we used an “inner” join, and because that is the earliest month shared across ALL source datasets (as constrained by the 3-mo maturity dataset).

16.6 Analyzing Merged Data

The reason why we went through the trouble of merging the data is so we can compare the two datasets more easily, for example to identify which periods had an “inverted yield curve” where the short-term yield was greater than the long-term yield.

We can identify periods of inversion visually by plotting both datasets on the same graph:

import plotly.express as px

px.line(merged_df, y=["10year", "3month"], height=450,
    title="US Treasury Yields",
    labels={"timestamp": "Date", "value": "Yield (%)", "variable": "Maturity"},
    color_discrete_map={"10year": "steelblue", "3month": "orange"},
)

Alternative plotting with shaded regions of inversion:

Code

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
  go.Scatter(x=merged_df.index, y=merged_df["10year"],
            name="10year", mode="lines", line=dict(color="steelblue")
      )
)

fig.add_trace(
  go.Scatter(x=merged_df.index, y=merged_df["3month"],
            name="3month", mode="lines", line=dict(color="orange")
            )
)

condition = merged_df["3month"] > merged_df["10year"]
shaded_regions = []
start = None

for i in range(len(condition)):
    if condition.iloc[i] and start is None:
        start = merged_df.index[i]
    elif not condition.iloc[i] and start is not None:
        shaded_regions.append((start, merged_df.index[i]))
        start = None

# If the last region is still open
if start is not None:
    shaded_regions.append((start, merged_df.index[-1]))

# Add shaded regions to the plot
for start, end in shaded_regions:
    fig.add_shape(
        type="rect",
        x0=start, x1=end,
        y0=min(merged_df["3month"].min(), merged_df["10year"].min()),
        y1=max(merged_df["3month"].max(), merged_df["10year"].max()),
        fillcolor="rgba(255, 165, 0, 0.3)",  # Semi-transparent orange
        line_width=0,
    )

fig.update_layout(
    title="US Treasury Yields",
    xaxis_title="Date",
    yaxis_title="Yield (%)",
    legend_title="Maturity",
    height=450
    #template="plotly_white"
)

fig.show()