Auto-Regressive Integrated Moving Average (ARIMA) is a “method for forecasting or predicting future outcomes based on a historical time series. It is based on the statistical concept of serial correlation, where past data points influence future data points.” - Source: Investopedia
An ARIMA model has three key components:
Auto-Regressive (AR) part: involves regressing the current value of the series against its past values (lags). The idea is that past observations have an influence on the current value.
Integrated (I) part: refers to the differencing of observations to make the time series stationary (i.e. to remove trends or seasonality). A stationary time series has constant mean and variance over time.
Moving Average (MA) part: involves modeling the relationship between the current value of the series and past forecast errors (residuals). The model adjusts the forecast based on the error terms from previous periods.
In practice, ARIMA models may be better at short term forecasting, and may not perform as well in forecasting over the long term.
Assumption of Stationarity
Assumption of stationarity
Remember, ARMA models require data to be stationary. The mean and variance and autocorrelation should remain fairly constant over time.
For instance, while stock prices are generally non-stationary, ARIMA models can still be used by transforming the data to achieve stationarity. This is done through differencing, which is the “Integrated” (I) component of ARIMA. Stock returns (or the percentage change from the previous period) are typically more stationary and suitable for modeling.
Examples
Data Source
These examples of autoregressive models are based on material by Prof. Ram Yamarthy.
Example 1 - Baseball Teams
Data Loading
Let’s consider this previous dataset of baseball team performance, which we learned exemplified some positive autocorrelation after two lagging periods:
import plotly.express as pxpx.line(x=y.index, y=y, height=450, title="Baseball Team (NYY) Annual Win Percentages", labels={"x": "Team", "y": "Win Percentage"},)
#import plotly.express as px##px.line(df, y="W-L%", height=450,# title="Baseball Team (NYY) Annual Win Percentages",# labels={"value": "Win Percentage", "variable": "Team"},#)
Stationarity
Check for stationarity:
from statsmodels.tsa.stattools import adfuller# Perform the Augmented Dickey-Fuller test for stationarityresult = adfuller(y)print(f'ADF Statistic: {result[0]}')print(f'P-value: {result[1]}')# If p-value > 0.05, the series is not stationary, and differencing is required
from sklearn.metrics import r2_scorer2_score(train_set["W-L%"], train_set["Predicted"])
0.3860394111991472
Plotting predictions during the training period:
px.line(train_set, y=["W-L%", "Predicted"], height=350, title="Baseball Team (NYY) Performance vs ARMA Predictions (Training Set)", labels={"value":""})
Evaluation
Reconstructing test set with predictions for the test period:
start = y_test.index[0]end = y_test.index[-1]start, end
#px.line(test_set, y=["W-L%", "Predicted"], height=350,# title="Baseball Team (NYY) Performance vs ARMA Predictions (Test Set)",# labels={"value":""}#)
Plotting predictions during the entire period:
from pandas import concatdf_pred = concat([train_set, test_set])df_pred
W-L%
Predicted
Error
Year
1903-01-01
0.537
0.568338
-0.031338
1904-01-01
0.609
0.549023
0.059977
1905-01-01
0.477
0.595914
-0.118914
...
...
...
...
2018-01-01
0.617
0.568260
-0.048740
2019-01-01
0.636
0.568294
-0.067706
2020-01-01
0.550
0.568313
0.018313
118 rows × 3 columns
px.line(df_pred, y=["W-L%", "Predicted"], height=350, title="Baseball Team (NYY) Performance vs ARMA Predictions", labels={"value":""})
We see the model quickly stabilizes after two years into the test period, corresponding with the number of lagging periods chosen.
Experimenting with different order parameter values may yield different results.