18  Correlation

Correlation is a measure of how two datasets are related to each other.

Reference: https://www.investopedia.com/terms/c/correlation.asp

Investment managers, traders, and analysts find it very important to calculate correlation because the risk reduction benefits of diversification rely on this statistic.

To examine correlation, let’s revisit our familiar dataset of economic indicators:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/monthly-indicators.csv"

df = read_csv(request_url)
df.head()
timestamp cpi fed spy gld
0 2024-05-01 314.069 5.33 525.6718 215.30
1 2024-04-01 313.548 5.33 500.3636 211.87
2 2024-03-01 312.332 5.33 521.3857 205.72
3 2024-02-01 310.326 5.33 504.8645 189.31
4 2024-01-01 308.417 5.33 479.8240 188.45
print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())
234
2004-12-01 ... 2024-05-01

18.1 Correlation Considerations

Let’s perform tests for correlation in more official / formal ways.

Certain methods for calculating correlation may depend on the normality of our data’s distribution, or the sample size, so we should keep these in mind as we determine if we are able to calculate correlation, and which method to use.

18.1.1 Parametric vs Nonparametric Methods

Reference: https://www.investopedia.com/terms/n/nonparametric-method.asp

The nonparametric method refers to a type of statistic that does not make any assumptions about the characteristics of the sample (its parameters) or whether the observed data is quantitative or qualitative.

Nonparametric statistics can include certain descriptive statistics, statistical models, inference, and statistical tests. The model structure of nonparametric methods is not specified a priori but is instead determined from data.

Common nonparametric tests include Chi-Square, Wilcoxon rank-sum test, Kruskal-Wallis test, and Spearman’s rank-order correlation.

In contrast, well-known statistical methods such as ANOVA, Pearson’s correlation, t-test, and others do make assumptions about the data being analyzed. One of the most common parametric assumptions is that population data have a “normal distribution.”

18.2 Calculating Correlation with scipy

We can calculate correlation between two lists of numbers, using the pearsonr and spearmanr functions from the scipy package.

One difference between these two correlation methods is that Spearman is more robust to (i.e. less affected by) outliers. Also being nonparametric, the Spearman method does not assume our data is normally distributed.

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

Pearson correlation coefficient and p-value for testing non-correlation.

The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Calculate a Spearman correlation coefficient with associated p-value.

The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. Although calculation of the p-value does not make strong assumptions about the distributions underlying the samples, it is only accurate for very large samples (>500 observations). For smaller sample sizes, consider a permutation test instead (see docs for examples).

from scipy.stats import pearsonr, spearmanr

x = df["cpi"]
y = df["gld"]

print("-----------")
print("PEARSON:")
result = pearsonr(x, y)
print(result)

print("-----------")
print("SPEARMAN:")
result = spearmanr(x, y)
print(result)
-----------
PEARSON:
PearsonRResult(statistic=np.float64(0.8237168740513686), pvalue=np.float64(4.30112037063362e-59))
-----------
SPEARMAN:
SignificanceResult(statistic=np.float64(0.7906610391016119), pvalue=np.float64(2.4422597634385946e-51))

Here we see the correlation between a given pair of variables.

What about the correlation between each pair of indicators? We could start to use a loop-based solution, and compare each combination of variables. But there is an easier way.

18.3 Correlation Matrix with pandas

The correlation matrix is a great way of displaying and communicating the correlation between each pair of variables.

If we have a pandas dataframe, we can use it’s corr method to produce a correlation matrix, which shows us the “pairwise correlation of columns” (in other words, the correlation of each column with respect to each other column).

# df.corr()
# ... method is "pearson" by default
# ... numeric_only to suppress warning

df.corr(method="pearson", numeric_only=True)
cpi fed spy gld
cpi 1.000000 0.078102 0.949065 0.823717
fed 0.078102 1.000000 0.172821 -0.263213
spy 0.949065 0.172821 1.000000 0.719160
gld 0.823717 -0.263213 0.719160 1.000000
df.corr(method="spearman", numeric_only=True)
cpi fed spy gld
cpi 1.000000 -0.102732 0.953588 0.790661
fed -0.102732 1.000000 0.005936 -0.308626
spy 0.953588 0.005936 1.000000 0.714306
gld 0.790661 -0.308626 0.714306 1.000000

We may begin to notice the diagonal of ones values. This is because each dataset is perfectly positively correlated with itself.

We may also start to notice the symmetry of values mirrored across the diagonal. In other words, the value in column 1, row 4 is the same as the value in column 4, row 1.

18.4 Plotting Correlation Matrix

It may not be easy to quickly interpret the rest of the values in the correlation matrix, but if we plot this matrix with colors as a “heat map”, then we will be able to use color to more easily interpret the data and tell a story.

18.4.1 Correlation Heatmap with plotly

We can use the imshow function from plotly to create a correlation heatmap:

import plotly.express as px

# https://plotly.com/python/heatmaps/
# https://plotly.com/python-api-reference/generated/plotly.express.imshow.html

def plot_correlation_matrix(df, method="pearson"):
    """Params: method (str): "spearman" or "pearson". """

    cor_mat = df.corr(method=method, numeric_only=True)

    title= f"{method.title()} Correlation between Economic Indicators"

    fig = px.imshow(cor_mat,
                    height=450, # title=title,
                    text_auto= ".2f", # round to two decimal places
                    color_continuous_scale="Blues",
                    color_continuous_midpoint=0,
                    labels={"x": "Indicator", "y": "Indicator"},
    )
    # center title (h/t: https://stackoverflow.com/questions/64571789/)
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.show()
plot_correlation_matrix(df, "pearson")
plot_correlation_matrix(df, "spearman")

What stories can we tell with the correlation heatmap? Which indicators are most positively correlated? Which are most negatively correlated?

Is gold a hedge against inflation?