Auto-Regressive Models

Auto-Regressive Integrated Moving Average (ARIMA) is a “method for forecasting or predicting future outcomes based on a historical time series. It is based on the statistical concept of serial correlation, where past data points influence future data points.” - Source: Investopedia

An ARIMA model has three key components:

Auto-Regressive (AR) part: involves regressing the current value of the series against its past values (lags). The idea is that past observations have an influence on the current value.
Integrated (I) part: refers to the differencing of observations to make the time series stationary (i.e. to remove trends or seasonality). A stationary time series has constant mean and variance over time.
Moving Average (MA) part: involves modeling the relationship between the current value of the series and past forecast errors (residuals). The model adjusts the forecast based on the error terms from previous periods.

In practice, ARIMA models may be better at short term forecasting, and may not perform as well in forecasting over the long term.

Assumption of Stationarity

Assumption of stationarity

Remember, ARMA models require data to be stationary. The mean and variance and autocorrelation should remain fairly constant over time.

For instance, while stock prices are generally non-stationary, ARIMA models can still be used by transforming the data to achieve stationarity. This is done through differencing, which is the “Integrated” (I) component of ARIMA. Stock returns (or the percentage change from the previous period) are typically more stationary and suitable for modeling.

Examples

Data Source

These examples of autoregressive models are based on material by Prof. Ram Yamarthy.

Example 1 - Baseball Teams

Data Loading

Let’s consider this previous dataset of baseball team performance, which we learned exemplified some positive autocorrelation after two lagging periods:

Code

from pandas import read_excel, DataFrame, to_datetime

repo_url = f"https://github.com/prof-rossetti/python-for-finance"
file_url = f"{repo_url}/raw/refs/heads/main/docs/data/baseball_data.xlsx"

df = read_excel(file_url, sheet_name="ny_yankees")

df.index = to_datetime(df["Year"], format="%Y")
df.drop(columns=["Year"], inplace=True)
df

	Tm	Lg	G	W	L	Ties	W-L%
Year
2020-01-01	New York Yankees	AL East	60	33	27	0	0.550
2019-01-01	New York Yankees	AL East	162	103	59	0	0.636
2018-01-01	New York Yankees	AL East	162	100	62	0	0.617
...	...	...	...	...	...	...	...
1905-01-01	New York Highlanders	AL	152	71	78	3	0.477
1904-01-01	New York Highlanders	AL	155	92	59	4	0.609
1903-01-01	New York Highlanders	AL	136	72	62	2	0.537

118 rows × 7 columns

Data Exploration

Sorting data:

df.sort_values(by="Year", ascending=True, inplace=True)

y = df["W-L%"]
print(y.shape)

(118,)

Plotting the time series data:

import plotly.express as px

px.line(x=y.index, y=y, height=450,
    title="Baseball Team (NYY) Annual Win Percentages",
    labels={"x": "Team", "y": "Win Percentage"},
)

#import plotly.express as px
#
#px.line(df, y="W-L%", height=450,
#    title="Baseball Team (NYY) Annual Win Percentages",
#    labels={"value": "Win Percentage", "variable": "Team"},
#)

Stationarity

Check for stationarity:

from statsmodels.tsa.stattools import adfuller

# Perform the Augmented Dickey-Fuller test for stationarity
result = adfuller(y)
print(f'ADF Statistic: {result[0]}')
print(f'P-value: {result[1]}')

# If p-value > 0.05, the series is not stationary, and differencing is required

ADF Statistic: -5.26693767337364
P-value: 6.40855693131697e-06

Autocorrelation

Examining autocorrelation over ten lagging periods:

from statsmodels.tsa.stattools import acf

n_lags = 10
acf_results = acf(y, nlags=n_lags, fft=True, missing="drop")
print(acf_results)

[1.         0.61148404 0.34890418 0.32078676 0.3459768  0.29498255
 0.2173049  0.18496153 0.11479746 0.07436017 0.06523038]

Plotting the autocorrelation results:

import plotly.express as px

fig = px.line(y=acf_results, markers=["o"], height=400,
        title=f"Auto-correlation of Annual Baseball Performance (NYY)",
        labels={"x": "Number of Lags", "y":"Auto-correlation"},
)
fig.show()

We see moderately high autocorrelation persists until two to four lagging periods.

Train/Test Split

#test_size = 0.2
#cutoff = round(len(y) * (1 - test_size))
#y_train = y.iloc[:cutoff] # all before cutoff
#y_test = y.iloc[cutoff:] # all after cutoff
#
#print("Y TRAIN:", y_train.shape)
#print("Y TEST:", y_test.shape)

def sequential_split(y, test_size=0.2):
    cutoff = round(len(y) * (1 - test_size))
    y_train = y.iloc[:cutoff] # all before cutoff
    y_test = y.iloc[cutoff:] # all after cutoff
    return y_train, y_test

y_train, y_test = sequential_split(y, test_size=0.1)
print("Y TRAIN:", y_train.shape)
print("Y TEST:", y_test.shape)

Y TRAIN: (106,)
Y TEST: (12,)

Model Training

To implement autoregressive moving average model in Python, we can use the ARIMA class from statsmodels.

from statsmodels.tsa.arima.model import ARIMA

n_periods = 2 # based on earlier autocorrelation analysis
model = ARIMA(y_train, order=(n_periods, 0, 0))
print(type(model))

results = model.fit()
print(type(results))

print(results.summary())

<class 'statsmodels.tsa.arima.model.ARIMA'>
<class 'statsmodels.tsa.arima.model.ARIMAResultsWrapper'>
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                   W-L%   No. Observations:                  106
Model:                 ARIMA(2, 0, 0)   Log Likelihood                 145.945
Date:                Mon, 04 Nov 2024   AIC                           -283.890
Time:                        01:37:55   BIC                           -273.236
Sample:                    01-01-1903   HQIC                          -279.572
                         - 01-01-2008                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5683      0.016     36.149      0.000       0.538       0.599
ar.L1          0.6438      0.091      7.053      0.000       0.465       0.823
ar.L2         -0.0446      0.090     -0.494      0.621      -0.221       0.132
sigma2         0.0037      0.001      7.108      0.000       0.003       0.005
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):                 5.07
Prob(Q):                              0.89   Prob(JB):                         0.08
Heteroskedasticity (H):               0.39   Skew:                            -0.51
Prob(H) (two-sided):                  0.01   Kurtosis:                         3.34
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Reconstruct training set with predictions:

#train_set = df.loc[y_train.index].copy()
train_set = y_train.copy().to_frame()
train_set["Predicted"] = results.fittedvalues
train_set["Error"] = results.resid
train_set

	W-L%	Predicted	Error
Year
1903-01-01	0.537	0.568338	-0.031338
1904-01-01	0.609	0.549023	0.059977
1905-01-01	0.477	0.595914	-0.118914
...	...	...	...
2006-01-01	0.599	0.577273	0.021727
2007-01-01	0.580	0.587292	-0.007292
2008-01-01	0.549	0.574480	-0.025480

106 rows × 3 columns

Training metrics:

from sklearn.metrics import r2_score

r2_score(train_set["W-L%"], train_set["Predicted"])

0.3860394111991472

Plotting predictions during the training period:

px.line(train_set, y=["W-L%", "Predicted"], height=350,
    title="Baseball Team (NYY) Performance vs ARMA Predictions (Training Set)",
    labels={"value":""}
)

Evaluation

Reconstructing test set with predictions for the test period:

start = y_test.index[0]
end = y_test.index[-1]
start, end

(Timestamp('2009-01-01 00:00:00'), Timestamp('2020-01-01 00:00:00'))

y_pred = results.predict(start=start, end=end)
print(y_pred.shape)

(12,)

test_set = y_test.copy().to_frame()
test_set["Predicted"] = y_pred
test_set["Error"] = test_set["Predicted"] - test_set["W-L%"]
test_set.head()

	W-L%	Predicted	Error
Year
2009-01-01	0.636	0.555368	-0.080632
2010-01-01	0.586	0.560850	-0.025150
2011-01-01	0.599	0.564095	-0.034905
2012-01-01	0.586	0.565940	-0.020060
2013-01-01	0.525	0.566983	0.041983

Testing metrics:

r2_score(test_set["W-L%"], test_set["Predicted"])

-0.12432303358156238

Not so good.

#px.line(test_set, y=["W-L%", "Predicted"], height=350,
#    title="Baseball Team (NYY) Performance vs ARMA Predictions (Test Set)",
#    labels={"value":""}
#)

Plotting predictions during the entire period:

from pandas import concat

df_pred = concat([train_set, test_set])
df_pred

	W-L%	Predicted	Error
Year
1903-01-01	0.537	0.568338	-0.031338
1904-01-01	0.609	0.549023	0.059977
1905-01-01	0.477	0.595914	-0.118914
...	...	...	...
2018-01-01	0.617	0.568260	-0.048740
2019-01-01	0.636	0.568294	-0.067706
2020-01-01	0.550	0.568313	0.018313

118 rows × 3 columns

px.line(df_pred, y=["W-L%", "Predicted"], height=350,
    title="Baseball Team (NYY) Performance vs ARMA Predictions",
    labels={"value":""}
)

We see the model quickly stabilizes after two years into the test period, corresponding with the number of lagging periods chosen.

Experimenting with different order parameter values may yield different results.

Example 2 - GDP Growth

TBA