Regression for Time Series Forecasting (with `sklearn`)

Let’s explore an example of how to use regression to perform trend analysis with time series data.

Data Loading

As an example time series dataset, let’s consider this dataset of U.S. population over time, from the Federal Reserve Economic Data (FRED).

Fetching the data, going back as far as possible:

from pandas_datareader import get_data_fred
from datetime import datetime

DATASET_NAME = "POPTHM"
df = get_data_fred(DATASET_NAME, start=datetime(1900,1,1))
print(len(df))
df

	POPTHM
DATE
1959-01-01	175818.0
1959-02-01	176044.0
1959-03-01	176274.0
...	...
2024-05-01	336687.0
2024-06-01	336839.0
2024-07-01	337005.0

787 rows × 1 columns

Data Source

Here is some more information about the “POPTHM” dataset:

“Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month.” The data is expressed in “Thousands”, and is “Not Seasonally Adjusted”.

Wrangling the data, including renaming columns and converting the date index to be datetime-aware, may make it easier for us to work with this data:

from pandas import to_datetime

df.rename(columns={DATASET_NAME: "population"}, inplace=True)
df.index.name = "date"
df.index = to_datetime(df.index)
df

	population
date
1959-01-01	175818.0
1959-02-01	176044.0
1959-03-01	176274.0
...	...
2024-05-01	336687.0
2024-06-01	336839.0
2024-07-01	337005.0

787 rows × 1 columns

Data Exploration

Exploring trends:

import plotly.express as px

px.scatter(df, y="population", title="US Population (Monthly) vs Trend",
            labels={"population":"US Population (thousands)", "value":""},
            trendline="ols", trendline_color_override="red", height=350,
)

Looks like a possible linear trend. Let’s perform a more formal regression analysis.

Data Encoding

Because we need numeric features to perform a regression, we convert the dates to a linear time step of integers (after sorting the data first for good measure):

df.sort_values(by="date", ascending=True, inplace=True)

df["time_step"] = range(1, len(df) + 1)
df

	population	time_step
date
1959-01-01	175818.0	1
1959-02-01	176044.0	2
1959-03-01	176274.0	3
...	...	...
2024-05-01	336687.0	785
2024-06-01	336839.0	786
2024-07-01	337005.0	787

787 rows × 2 columns

We will use the numeric time step as our input variable (x), to predict the population (y).

Data Splitting

X/Y Split

Identifying dependent and independent variables:

#x = df[["date"]] # we need numbers not strings
x = df[["time_step"]]

y = df["population"]

print("X:", x.shape)
print("Y:", y.shape)

X: (787, 1)
Y: (787,)

Train/Test Split

Splitting into training vs testing datasets:

#from sklearn.model_selection import train_test_split
#
#x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
#print("TRAIN:", x_train.shape, y_train.shape)
#print("TEST:", x_test.shape, y_test.shape)

Splitting data sequentially where earlier data is used in training and recent data is use for testing:

print(len(df))

training_size = round(len(df) * .8)
print(training_size)

x_train = x.iloc[:training_size] # slice all before
y_train = y.iloc[:training_size] # slice all before

x_test = x.iloc[training_size:] # slice all after
y_test = y.iloc[training_size:] # slice all after
print("TRAIN:", x_train.shape)
print("TEST:", x_test.shape)

787
630
TRAIN: (630, 1)
TEST: (157, 1)

Model Selection and Training

Training a linear regression model on the training data:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After training, we have access to the learned weights, as well as the line of best fit (i.e. the trend line):

print("COEFS:", model.coef_)
print("INTERCEPT:", model.intercept_)
print("------------------")
print(f"y = {model.coef_[0].round(3)}x + {model.intercept_.round(3)}")

COEFS: [212.81417727]
INTERCEPT: 174589.6302470538
------------------
y = 212.814x + 174589.63

In this case, we interpret the line of best fit to observe how much the population is expected to grow on average per time step, as well as the population trend value at the earliest time step.

Note

Remember in this dataset the population is expressed in thousands.

Model Prediction and Evaluation

We use the trained model to make predictions on the test set, and then calculate regression metrics to see how well the model is doing:

from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(x_test)

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse, 4))

r2 = r2_score(y_test, y_pred)
print("R2:", round(r2, 4))

MSE: 9038018.7307
R2: 0.8216

A very high r-squared value represents a strong linear trend.

Plotting predictions (trend line):

#df_preds = df.copy()
#df_preds["prediction"] = model.predict(df_preds[["time_step"]])
#df_preds["error"] = df_preds["population"] - df_preds["prediction"]
#
#px.line(df_preds, y=["population", "prediction"], height=350,
#    title="US Population (Monthly) vs Regression Predictions (Trend)",
#    labels={"value":""}
#)

df["prediction"] = model.predict(df[["time_step"]])
df["error"] = df["population"] - df["prediction"]

px.line(df, y=["population", "prediction"], height=350,
    title="US Population (Monthly) vs Regression Predictions (Trend)",
    labels={"value":""}
)

Forecasting

Assembling a dataset of future dates and time steps (which we can use as inputs to the trained model to make predictions about the future):

from pandas import date_range, DateOffset, DataFrame

last_time_step = df['time_step'].iloc[-1]
last_date = df.index[-1]
next_date = last_date + DateOffset(months=1)

FUTURE_MONTHS = 36
# frequency of "M" for end of month, "MS" for beginning of month
future_dates = date_range(start=next_date, periods=FUTURE_MONTHS, freq='MS')
future_time_steps = range(last_time_step + 1, last_time_step + FUTURE_MONTHS + 1)

df_future = DataFrame({'time_step': future_time_steps}, index=future_dates)
df_future.index.name = "date"
df_future

	time_step
date
2024-08-01	788
2024-09-01	789
2024-10-01	790
...	...
2027-05-01	821
2027-06-01	822
2027-07-01	823

36 rows × 1 columns

Predicting future values:

df_future["prediction"] = model.predict(df_future[["time_step"]])
df_future

	time_step	prediction
date
2024-08-01	788	342287.201933
2024-09-01	789	342500.016110
2024-10-01	790	342712.830288
...	...	...
2027-05-01	821	349310.069783
2027-06-01	822	349522.883960
2027-07-01	823	349735.698137

36 rows × 2 columns

Concatenating historical data with future data:

from pandas import concat

chart_df = concat([df, df_future])
chart_df

	population	time_step	prediction	error
date
1959-01-01	175818.0	1	174802.444424	1015.555576
1959-02-01	176044.0	2	175015.258602	1028.741398
1959-03-01	176274.0	3	175228.072779	1045.927221
...	...	...	...	...
2027-05-01	NaN	821	349310.069783	NaN
2027-06-01	NaN	822	349522.883960	NaN
2027-07-01	NaN	823	349735.698137	NaN

823 rows × 4 columns

Note

The population and error values for future dates are null, because we don’t know them yet. Although we are able to make predictions about these values, based on historical trends.

Plotting trend vs actual, with future predictions:

px.line(chart_df[-180:], y=["population", "prediction"], height=350,
    title="US Population (Monthly) vs Regression Predictions (Trend)",
    labels={"value":""}
)