Data Loading

Regression with Polynomial Features for Time Series Forecasting

To illustrate an example of a non-linear trend in time series data, let’s consider this dataset of U.S. GDP over time, from the Federal Reserve Economic Data (FRED).

Fetching the data, going back as far as possible:

from pandas_datareader import get_data_fred

df = get_data_fred("GDP", start="1900-01-01")
df.index.name = "date"
df.rename(columns={"GDP": "gdp"}, inplace=True)
df.head()
gdp
date
1947-01-01 243.164
1947-04-01 245.968
1947-07-01 249.585
1947-10-01 259.745
1948-01-01 265.742
Data Source

Here is some more information about the “GDP” dataset:

“Gross domestic product (GDP), the featured measure of U.S. output, is the market value of the goods and services produced by labor and property located in the United States.

The data is expressed in “Billions of Dollars”, and is a “Seasonally Adjusted Annual Rate”.

The dataset frequency is “Quarterly”.

Data Exploration

Plotting the data over time with a linear trendline to examine a possible linear relationship:

import plotly.express as px

px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Linear Trend", height=450,
           labels={"gdp": "GDP (in billions of USD)"},
            trendline="ols", trendline_color_override="red"
)

Linear trend might not be the best fit.

Plotting the data over time with a Lowess trendline to examine a possible non-linear relationship:

import plotly.express as px

px.scatter(df, y="gdp", title="US GDP (Quarterly) vs Lowess Trend", height=450,
           labels={"gdp": "GDP (in billions of USD)"},
            trendline="lowess", trendline_color_override="red"
)

In this case, a non-linear trend seems to fit better.

To compare the results of a linear vs non-linear trend, let’s train two different regression models, and compare the results.

Linear Regression

Sorting time series data:

from pandas import to_datetime

df.sort_values(by="date", ascending=True, inplace=True)
df["time_step"] = range(1, len(df)+1)
df.head()
gdp time_step
date
1947-01-01 243.164 1
1947-04-01 245.968 2
1947-07-01 249.585 3
1947-10-01 259.745 4
1948-01-01 265.742 5

Identifying labels and features (x/y split):

x = df[['time_step']]
y = df['gdp']
print(x.shape)
print(y.shape)
(311, 1)
(311,)

Train/Test Split

Test/train split for time-series data:

training_size = round(len(df) * .8)

x_train = x.iloc[:training_size] # all before cutoff
y_train = y.iloc[:training_size] # all before cutoff

x_test = x.iloc[training_size:] # all after cutoff
y_test = y.iloc[training_size:] # all after cutoff

print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (249, 1) (249,)
TEST: (62, 1) (62,)

Model Training

Training a linear regression model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Examining the coefficients and line of best fit:

print("COEF:", model.coef_.tolist())
print("INTERCEPT:", model.intercept_)
print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print("y =", f"{model.coef_[0].round(3)}(x)",
        "+", model.intercept_.round(3),
)
COEF: [55.54545814691021]
INTERCEPT: -2732.7543165565467
--------------
EQUATION FOR LINE OF BEST FIT:
y = 55.545(x) + -2732.754

Examining the training results:

from sklearn.metrics import mean_squared_error, r2_score

y_pred_train = model.predict(x_train)

r2_train = r2_score(y_train, y_pred_train)
print("R^2 (TRAINING):", r2_train)

mse_train = mean_squared_error(y_train, y_pred_train)
print("MSE (TRAINING):", mse_train)
R^2 (TRAINING): 0.8525779291832625
MSE (TRAINING): 2756360.207247715

A strong positive that the linear regression model explains about 85% of the variance in the GDP data during the training period. It suggests that the model fits the training data reasonably well.

These results are promising, however what we really care about is how the model generalizes to the test set.

Model Evaluation

Examining the test results:

y_pred = model.predict(x_test)

r2 = r2_score(y_test, y_pred)
print("R^2 (TEST):", r2.round(3))

mse = mean_squared_error(y_test, y_pred)
print("MSE (TEST):", mse.round(3))
R^2 (TEST): -2.546
MSE (TEST): 63461178.162
Results interpretation

A negative r-squared value suggests the model’s predictions are very poor, and even less accurate than using the average of the target variable as the prediction for all instances. Essentially, the linear regression model is providing predictions that are significantly off from the actual values, indicating a bad fit to the data.

It seems although the model performs well on the training set, it performs very poorly on future data it hasn’t seen yet, and doesn’t generalize beyond the training period.

Storing predictions back in the original dataset, to enable comparisons of predicted vs actual values:

df.loc[x_train.index, "y_pred_train"] = y_pred_train
df.loc[x_test.index, "y_pred_test"] = y_pred
df
gdp time_step y_pred_train y_pred_test
date
1947-01-01 243.164 1 -2677.208858 NaN
1947-04-01 245.968 2 -2621.663400 NaN
1947-07-01 249.585 3 -2566.117942 NaN
... ... ... ... ...
2024-01-01 28624.069 309 NaN 14430.792251
2024-04-01 29016.714 310 NaN 14486.337709
2024-07-01 29349.924 311 NaN 14541.883167

311 rows × 4 columns

Charting the predictions against the actual values:

#import plotly.express as px
#
#fig = px.line(df, y=['gdp', 'y_pred_train', 'y_pred_test'],
#              title='Linear Regression on GDP Ta andime Series Data',
#              labels={'value': 'GDP', 'date': 'Date'},
#)
## update legend:
#fig.update_traces(line=dict(color='blue'), name="Actual GDP", selector=dict(name='gdp'))
#fig.update_traces(line=dict(color='green'), name="Predicted GDP (Train)", selector=dict#(name='y_pred_train'))
#fig.update_traces(line=dict(color='red'), name="Predicted GDP (Test)", selector=dict#(name='y_pred_test'))
#
#fig.show()
Code
import plotly.express as px

def plot_predictions(df, title="Linear Regression on GDP Time Series Data"):
    fig = px.line(df, y=['gdp', 'y_pred_train', 'y_pred_test'],
        labels={'value': 'GDP', 'date': 'Date'}, title=title, height=450
    )

    # update legend:
    series = [
        ("gdp", "Actual GDP", "steelblue"),
        ("y_pred_train", "Predicted GDP (Train)", "green"),
        ("y_pred_test", "Predicted GDP (Test)", "red"),
    ]
    for colname, name, color in series:
        fig.update_traces(name=name,
            line=dict(color=color),
            selector=dict(name=colname)
    )

    fig.show()
plot_predictions(df)

As we see, the linear trend line is not a good fit.

Let’s see if a non-linear trend can do better.

Polynomial Features Regression

One way to help a linear regression model capture a non-linear trend is to train the model on polynomial features, instead of the original features.

By transforming the original features into higher-order terms, polynomial features allow the model to capture non-linear relationships, offering greater flexibility and improving the model’s ability to generalize to more complex patterns in the data.

In a linear equation, the highest power of the \(x\) term is one (a variable raised to the power of one is itself).

\[ y = mx + b \]

Equation 1: Linear equation.

In a polynomial equation, there can be higher powers for the leading terms. For example in a quadratic equation the highest power of the \(x\) term is two:

\[ y = ax^2 + bx + c \]

Equation 2: Quadratic equation.

Creating polynomial features, using the PolynomialFeatures class from sklearn. In this case, setting degree=2 creates terms for \(x^2\), \(𝑥\), and the intercept term:

from sklearn.preprocessing import PolynomialFeatures
from pandas import DataFrame

poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)

x_poly = DataFrame(x_poly, columns=["const", "time_step", "time_step_squared"])
x_poly.index = x.index
x_poly
const time_step time_step_squared
date
1947-01-01 1.0 1.0 1.0
1947-04-01 1.0 2.0 4.0
1947-07-01 1.0 3.0 9.0
... ... ... ...
2024-01-01 1.0 309.0 95481.0
2024-04-01 1.0 310.0 96100.0
2024-07-01 1.0 311.0 96721.0

311 rows × 3 columns

Splitting the new features using sequential split, in the same way as the original features:

x_train = x_poly.iloc[:training_size] # all before cutoff
x_test = x_poly.iloc[training_size:] # all after cutoff

print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (249, 3) (249,)
TEST: (62, 3) (62,)

Training a linear regression model on the polynomial features:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Examining the coefficients and line of best fit:

print("COEF:", model.coef_.round(3).tolist())
print("INTERCEPT:", model.intercept_.round(3))
print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print("y =", f"{model.coef_[2].round(3)}(x^2)",
        "+", f"{model.coef_[1].round(3)}(x)",
        "+", model.intercept_.round(3),
)
COEF: [0.0, -33.245, 0.355]
INTERCEPT: 981.654
--------------
EQUATION FOR LINE OF BEST FIT:
y = 0.355(x^2) + -33.245(x) + 981.654

Examining the training results:

from sklearn.metrics import mean_squared_error, r2_score

y_pred_train = model.predict(x_train)

r2_train = r2_score(y_train, y_pred_train)
print("R^2 (TRAINING):", r2_train.round(3))

mse_train = mean_squared_error(y_train, y_pred_train)
print("MSE (TRAINING):", mse_train.round(3))
R^2 (TRAINING): 0.997
MSE (TRAINING): 62691.652

Examining the test results:

from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(x_test)

r2 = r2_score(y_test, y_pred)
print("R^2:", r2.round(3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse.round(3))
R^2: 0.867
MSE: 2377750.298

This model does a lot better generalizing to the test data than the original model.

Storing predictions back in a copy of the original dataset, to enable comparisons of predicted vs actual values:

df_poly = df.copy()
df_poly.loc[x_train.index, "y_pred_train"] = y_pred_train
df_poly.loc[x_test.index, "y_pred_test"] = y_pred
title = "Linear Regression (with Polynomial Features) on GDP Time Series Data"
plot_predictions(df_poly, title=title)
Interactive dataviz

Zoom in on an area of the graph to examine predictions during a specific time range.

Here we see although the trend isn’t perfect, the model trained on polynomial features captures the data much better than the basic linear trend produced by training a model on the original features.