11  Regression for Seasonality Analysis

We’ve explored using a regression for time series forecasting, but what if there are seasonal or cyclical patterns in the data?

Let’s explore an example of how to use regression to identify cyclical patterns and perform seasonality analysis with time series data.

11.1 Data Loading

For a time series dataset that exemplifies cyclical patterns, let’s consider this dataset of U.S. employment over time, from the Federal Reserve Economic Data (FRED).

Fetching the data, going back as far as possible:

from pandas_datareader import get_data_fred
from datetime import datetime

DATASET_NAME = "PAYNSA"
df = get_data_fred(DATASET_NAME, start=datetime(1900,1,1))
print(len(df))
df
1047
PAYNSA
DATE
1939-01-01 29296
1939-02-01 29394
1939-03-01 29804
... ...
2026-01-01 156728
2026-02-01 157204
2026-03-01 157775

1047 rows × 1 columns

TipData Source

Here is some more information about the “PAYNSA” dataset:

“All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed.”

“Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools.”

“The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women’s participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy.

To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted.”

This “PYYNSA” data is expressed in “Thousands of Persons”, and is “Not Seasonally Adjusted”.

The dataset frequency is “Monthly”.

Wrangling the data, including renaming columns and converting the date index to be datetime-aware, may make it easier for us to work with this data:

from pandas import to_datetime

df.rename(columns={DATASET_NAME: "employment"}, inplace=True)
df.index.name = "date"
df.index = to_datetime(df.index)
df
employment
date
1939-01-01 29296
1939-02-01 29394
1939-03-01 29804
... ...
2026-01-01 156728
2026-02-01 157204
2026-03-01 157775

1047 rows × 1 columns

11.2 Data Exploration

Visualizing the data:

import plotly.express as px

px.line(df, y="employment", height=450,
        title="US Employment by month (non-seasonally adjusted)",
        labels={"employment": "Employment (in thousands of persons)"},
)

Cyclical Patterns

Exploring cyclical patterns in the data:

px.line(df[(df.index.year >= 1970) & (df.index.year <= 1980)], y="employment",
        title="US Employment by month (selected years)", height=450,
        labels={"Employment": "Employment (in thousands)"},
)
TipInteractive dataviz

Hover over the dataviz to see which month(s) typically have higher employment, and which month(s) typically have lower employment.

Trend Analysis

Exploring trends:

import plotly.express as px

px.scatter(df, y="employment",  height=450,
        title="US Employment by month (vs Trend)",
        labels={"employment": "Employment (in thousands)"},
        trendline="ols", trendline_color_override="red"
)

Looks like evidence of a possible linear relationship. Let’s perform a more formal regression analysis.

11.3 Data Encoding

Because we need numeric features to perform a regression, we convert the dates to a linear time step of integers (after sorting the data first for good measure):

df.sort_values(by="date", ascending=True, inplace=True)

df["time_step"] = range(1, len(df) + 1)
df
employment time_step
date
1939-01-01 29296 1
1939-02-01 29394 2
1939-03-01 29804 3
... ... ...
2026-01-01 156728 1045
2026-02-01 157204 1046
2026-03-01 157775 1047

1047 rows × 2 columns

We will use the numeric time step as our input variable (x), to predict the employment (y).

11.4 Data Splitting

X/Y Split

Identifying dependent and independent variables:

x = df[["time_step"]]

y = df["employment"]

print("X:", x.shape)
print("Y:", y.shape)
X: (1047, 1)
Y: (1047,)

Adding Constants

We are going to use statsmodels, so we add a column of constant ones representing the intercept:

import statsmodels.api as sm

# adding in a column of constants, as per the OLS docs
x = sm.add_constant(x)
x.head()
const time_step
date
1939-01-01 1.0 1
1939-02-01 1.0 2
1939-03-01 1.0 3
1939-04-01 1.0 4
1939-05-01 1.0 5

Train/Test Split

Splitting into training vs testing datasets:

#from sklearn.model_selection import train_test_split
#
#x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
#print("TRAIN:", x_train.shape, y_train.shape)
#print("TEST:", x_test.shape, y_test.shape)

Splitting data sequentially where earlier data is used in training and recent data is use for testing:

#print(len(df))
#
#training_size = round(len(df) * .8)
#print(training_size)
#
#x_train = x.iloc[:training_size] # slice all before
#y_train = y.iloc[:training_size] # slice all before
#
#x_test = x.iloc[training_size:] # slice all after
#y_test = y.iloc[training_size:] # slice all after
#print("TRAIN:", x_train.shape)
#print("TEST:", x_test.shape)

For this example, we will not split the data. To help illustrate a story about predictions over the entire time period.

11.5 Model Selection and Training

Training a linear regression model on the training data:

import statsmodels.api as sm

model = sm.OLS(y, x, missing="drop")
print(type(model))

results = model.fit()
print(type(results))
<class 'statsmodels.regression.linear_model.OLS'>
<class 'statsmodels.regression.linear_model.RegressionResultsWrapper'>

Examining training results:

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             employment   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                 6.586e+04
Date:                Tue, 14 Apr 2026   Prob (F-statistic):               0.00
Time:                        23:11:22   Log-Likelihood:                -10370.
No. Observations:                1047   AIC:                         2.074e+04
Df Residuals:                    1045   BIC:                         2.075e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.692e+04    299.862     89.780      0.000    2.63e+04    2.75e+04
time_step    127.2107      0.496    256.625      0.000     126.238     128.183
==============================================================================
Omnibus:                        5.594   Durbin-Watson:                   0.048
Prob(Omnibus):                  0.061   Jarque-Bera (JB):                5.337
Skew:                           0.136   Prob(JB):                       0.0694
Kurtosis:                       2.781   Cond. No.                     1.21e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.21e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
print(results.params)
print("------------")
print(f"y = {results.params['time_step'].round(3)}x + {results.params['const'].round(3)}")
const        26921.593996
time_step      127.210700
dtype: float64
------------
y = 127.211x + 26921.594
df["prediction"] = results.fittedvalues
df["residual"] = results.resid
#from pandas import DataFrame
#
## get all rows from the original dataset that wound up in the training set:
#training_set = df.loc[x_train.index].copy()
#print(len(training_set))
#
## create a dataset for the predictions and the residuals:
#training_preds = DataFrame({
#    "prediction": results.fittedvalues,
#    "residual": results.resid
#})
## merge the training set with the results:
#training_set = training_set.merge(training_preds,
#    how="inner", left_index=True, right_index=True
#)
#
## calculate error for each datapoint:
#training_set

Regression Trends

Plotting trend line:

px.line(df, y=["employment", "prediction"], height=350,
    title="US Employment (monthly) vs linear trend",
    labels={"value":""}
)

Regression Residuals

Removing the trend, plotting just the residuals:

px.line(df, y="residual",
    title="US Employment (monthly) vs linear trend residuals", height=350
)

There seem to be some periodic movements in the residuals.

11.5.0.1 Seasonality via Means of Periodic Residuals

Observe there may be some cyclical patterns in the residuals, by calculating periodic means:

df["year"] = df.index.year
df["quarter"] = df.index.quarter
df["month"] = df.index.month

Here we are grouping the data by quarter and calculating the average residual. This shows us for each quarter, on average, whether predictions are above or below trend:

df.groupby("quarter")["residual"].mean()
quarter
1   -1012.694137
2     157.815054
3      58.140807
4     808.378438
Name: residual, dtype: float64
df.groupby("month")["residual"].mean()
month
1    -1296.551618
2    -1076.410046
3     -665.120746
4     -335.353556
5      158.700111
6      650.098606
7     -174.767266
8      -39.426242
9      388.615931
10     730.232817
11     832.229013
12     862.673485
Name: residual, dtype: float64

11.5.0.2 Seasonality via Regression on Periodic Residuals

Let’s perform a regression using months as the features and the trend residuals as the target. This can help us understand the degree to which employment will be over or under trend for a given month.

# https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
# "one hot encode" the monthly values:
from pandas import get_dummies as one_hot_encode

x_monthly = one_hot_encode(df["month"])
x_monthly.columns=["Jan", "Feb", "Mar", "Apr",
                "May", "Jun", "Jul", "Aug",
                "Sep", "Oct", "Nov", "Dec"]
x_monthly = x_monthly.astype(int)
x_monthly
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
date
1939-01-01 1 0 0 0 0 0 0 0 0 0 0 0
1939-02-01 0 1 0 0 0 0 0 0 0 0 0 0
1939-03-01 0 0 1 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
2026-01-01 1 0 0 0 0 0 0 0 0 0 0 0
2026-02-01 0 1 0 0 0 0 0 0 0 0 0 0
2026-03-01 0 0 1 0 0 0 0 0 0 0 0 0

1047 rows × 12 columns

y_monthly = df["residual"]

ols_monthly = sm.OLS(y_monthly, x_monthly)
print(type(ols_monthly))

results_monthly = ols_monthly.fit()
print(type(results_monthly))

print(results_monthly.summary())
<class 'statsmodels.regression.linear_model.OLS'>
<class 'statsmodels.regression.linear_model.RegressionResultsWrapper'>
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               residual   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     2.055
Date:                Tue, 14 Apr 2026   Prob (F-statistic):             0.0211
Time:                        23:11:22   Log-Likelihood:                -10358.
No. Observations:                1047   AIC:                         2.074e+04
Df Residuals:                    1035   BIC:                         2.080e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Jan        -1296.5516    513.699     -2.524      0.012   -2304.563    -288.541
Feb        -1076.4100    513.699     -2.095      0.036   -2084.421     -68.399
Mar         -665.1207    513.699     -1.295      0.196   -1673.132     342.890
Apr         -335.3536    516.643     -0.649      0.516   -1349.141     678.434
May          158.7001    516.643      0.307      0.759    -855.087    1172.488
Jun          650.0986    516.643      1.258      0.209    -363.689    1663.886
Jul         -174.7673    516.643     -0.338      0.735   -1188.555     839.020
Aug          -39.4262    516.643     -0.076      0.939   -1053.214     974.361
Sep          388.6159    516.643      0.752      0.452    -625.172    1402.403
Oct          730.2328    516.643      1.413      0.158    -283.555    1744.020
Nov          832.2290    516.643      1.611      0.108    -181.559    1846.017
Dec          862.6735    516.643      1.670      0.095    -151.114    1876.461
==============================================================================
Omnibus:                        6.224   Durbin-Watson:                   0.025
Prob(Omnibus):                  0.045   Jarque-Bera (JB):                5.851
Skew:                           0.140   Prob(JB):                       0.0536
Kurtosis:                       2.765   Cond. No.                         1.01
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients tell us how each month contributes towards the regression residuals, in other words, for each month, to what degree does the model predict we will be above or below trend?

Monthly Predictions of Residuals

df["prediction_monthly"] = results_monthly.fittedvalues
df["residual_monthly"] = results_monthly.resid

Decomposition of the original data into trend, seasonal component, and residuals:

px.line(df, y=["employment", "prediction"], title="Employment vs trend", height=350)
px.line(df, y="prediction_monthly", title="Employment seasonal component", height=350)
px.line(df, y="residual_monthly", title="Employment de-trended residual", height=350)