Linear Regression with statsmodels

Data Loading

Loading the data:

from pandas import read_csv

request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"
df = read_csv(request_url)
df.head()
Name StudyHours Grade
0 Arun 10.00 50.0
1 Sofia 11.50 50.0
2 Hassan 9.00 47.0
3 Zara 16.00 97.0
4 Liam 9.25 49.0

Data Exploration

Checking for null values:

df["StudyHours"].isna().sum()
np.int64(1)
df.tail()
Name StudyHours Grade
19 Maya 12.0 52.0
20 Yusuf 12.5 63.0
21 Zainab 12.0 64.0
22 Juan 8.0 NaN
23 Ali NaN NaN

Dropping null values:

df.dropna(inplace=True)
df.tail()
Name StudyHours Grade
17 Tariq 6.0 35.0
18 Lakshmi 10.0 48.0
19 Maya 12.0 52.0
20 Yusuf 12.5 63.0
21 Zainab 12.0 64.0

Exploring relationship between variables:

import plotly.express as px

px.scatter(df, x="StudyHours", y="Grade", height=350,
            title="Relationship between Study Hours and Grades",
            trendline="ols", trendline_color_override="red",
)

Checking for normality and outliers:

px.violin(df, x="StudyHours", box=True, points="all", height=350,
    title="Distribution of Study Hours",
)
px.violin(df, x="Grade", box=True, points="all", height=350,
            title="Distribution of Grade"
)

Data Splitting

X/Y Split

Identifying the dependent and independent variables:

x = df["StudyHours"]
print(x.shape)

y = df["Grade"]
print(y.shape)
(22,)
(22,)

Adding Constants

Note

When using statsmodels, the documentation instructs us to manually add a column of ones (to help the model perform calculations related to the y-intercept):

import statsmodels.api as sm

x = sm.add_constant(x) # adding in a column of constants, as per the OLS docs
x.head()
const StudyHours
0 1.0 10.00
1 1.0 11.50
2 1.0 9.00
3 1.0 16.00
4 1.0 9.25

Train Test Split

Now we split the training and test sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (16, 2) (16,)
TEST: (6, 2) (6,)

Model Selection and Training

Selecting a linear regression (OLS) model, and training it on the training data to learn the ideal weights:

import statsmodels.api as sm

model = sm.OLS(y_train, x_train, missing="drop")
print(type(model))

results = model.fit()
print(type(results))
<class 'statsmodels.regression.linear_model.OLS'>
<class 'statsmodels.regression.linear_model.RegressionResultsWrapper'>

It is possible to access the training results, including some summary statistics:

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Grade   R-squared:                       0.865
Model:                            OLS   Adj. R-squared:                  0.855
Method:                 Least Squares   F-statistic:                     89.55
Date:                Tue, 17 Sep 2024   Prob (F-statistic):           1.84e-07
Time:                        20:51:57   Log-Likelihood:                -56.257
No. Observations:                  16   AIC:                             116.5
Df Residuals:                      14   BIC:                             118.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -17.9242      7.078     -2.532      0.024     -33.105      -2.743
StudyHours     6.3637      0.672      9.463      0.000       4.921       7.806
==============================================================================
Omnibus:                        0.338   Durbin-Watson:                   2.313
Prob(Omnibus):                  0.845   Jarque-Bera (JB):                0.089
Skew:                          -0.162   Prob(JB):                        0.957
Kurtosis:                       2.833   Cond. No.                         34.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:418: UserWarning:

`kurtosistest` p-value may be inaccurate with fewer than 20 observations; only n=16 observations were given.
Training vs Testing Metrics

The training results contain an r-squared score, however this represents the error for the training data. To get the real results of how the model generalizes to the test data, we will calculate the r-squared score and other metrics on the test results later.

Interpreting P-values

In the OLS training results summary, the column labeled P>|t| represents the p-value for the corresponding t-statistic of each coefficient.

The t-statistic is used to test whether a particular coefficient is significantly different from zero. The p-value (P>|t|) tells you the probability of observing a t-statistic at least as extreme as the one calculated, assuming the null hypothesis is true (where the null hypothesis typically posits that the coefficient is zero, meaning the feature has no significant effect on the dependent variable).

  • A low p-value (typically less than 0.05) suggests that you can reject the null hypothesis, meaning the coefficient is statistically significant and likely has an impact on the dependent variable.
  • A high p-value (greater than 0.05) indicates that the coefficient is not statistically significant, implying that the feature may not contribute meaningfully to the model.

The part of the training results we care about are the the learned weights (i.e. coefficients), which we use to arrive at the line of best fit:

print(results.params)
print("-----------")
print(f"y = {results.params['StudyHours'].round(3)}x + {results.params['const'].round(3)}")
const        -17.924187
StudyHours     6.363725
dtype: float64
-----------
y = 6.364x + -17.924

The training results also contain the fittedvalues (predictions), as well as the resid (residuals or errors). We can compare each of the predicted values against the actual known values, to verify the residuals for illustration purposes:

from pandas import DataFrame

# get all rows from the original dataset that wound up in the training set:
training_set = df.loc[x_train.index].copy()

# create a dataset for the predictions and the residuals:
training_preds = DataFrame({
    "Predictions": results.fittedvalues,
    "Residuals": results.resid
})
# merge the training set with the results:
training_set = training_set.merge(training_preds,
    how="inner", left_index=True, right_index=True
)

# calculate error for each datapoint:
training_set["My Error"] = training_set["Grade"] - training_set["Predictions"]
training_set
Name StudyHours Grade Predictions Residuals My Error
15 Anika 8.00 27.0 32.985616 -5.985616 -5.985616
0 Arun 10.00 50.0 45.713067 4.286933 4.286933
18 Lakshmi 10.00 48.0 45.713067 2.286933 2.286933
13 Mei 8.00 15.0 32.985616 -17.985616 -17.985616
12 Priya 9.00 37.0 39.349341 -2.349341 -2.349341
20 Yusuf 12.50 63.0 61.622380 1.377620 1.377620
7 Kwame 9.00 42.0 39.349341 2.650659 2.650659
21 Zainab 12.00 64.0 58.440518 5.559482 5.559482
16 Elif 9.00 36.0 39.349341 -3.349341 -3.349341
5 Xia 1.00 3.0 -11.560462 14.560462 14.560462
4 Liam 9.25 49.0 40.940273 8.059727 8.059727
19 Maya 12.00 52.0 58.440518 -6.440518 -6.440518
9 Takumi 14.50 74.0 74.349831 -0.349831 -0.349831
8 Fatima 8.50 26.0 36.167479 -10.167479 -10.167479
3 Zara 16.00 97.0 83.895419 13.104581 13.104581
1 Sofia 11.50 50.0 55.258655 -5.258655 -5.258655

It is possible to calculate the training metrics ourselves, to verify the regression results summary we saw above:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

r2 = r2_score(training_set["Grade"], training_set["Predictions"])
print("R^2:", round(r2, 3))

mae = mean_absolute_error(training_set["Grade"], training_set["Predictions"])
print("MAE:", round(mae, 3))

mse = mean_squared_error(training_set["Grade"], training_set["Predictions"])
print("MSE:", round(mse,3))

rmse = mse ** .5
print("RMSE:", rmse.round(3))
R^2: 0.865
MAE: 6.486
MSE: 66.301
RMSE: 8.143

Remember, these metrics only tell us about the predictive accuracy on the training data, and we care more about evaluating metrics on the test set.

Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

preds = results.get_prediction(x_test)
print(type(preds))

preds_df = preds.summary_frame(alpha=0.05)
preds_df
<class 'statsmodels.regression._prediction.PredictionResults'>
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
17 20.258165 3.468123 12.819780 27.696550 0.161099 40.355231
14 80.713556 4.282255 71.529034 89.898079 59.906876 101.520237
2 39.349341 2.280844 34.457418 44.241265 20.049253 58.649430
10 80.713556 4.282255 71.529034 89.898079 59.906876 101.520237
6 55.258655 2.394199 50.123609 60.393701 35.895515 74.621795
11 69.577037 3.322979 62.449955 76.704119 49.593099 89.560975

When we use the summary_frame method on prediction results, it returns a DataFrame with several columns. Here’s a breakdown of what they mean:

  • mean: This is the predicted value for each observation based on the model. It’s essentially the point prediction (ŷ) for the corresponding input.

  • mean_se: This stands for the standard error of the predicted mean. It measures the uncertainty associated with the predicted value due to sampling variability. A smaller mean_se indicates higher confidence in the predicted mean.

  • mean_ci_lower: The lower bound of the confidence interval for the predicted mean. The confidence interval quantifies the uncertainty of the predicted mean.

  • mean_ci_upper: The upper bound of the confidence interval for the predicted mean. Together with the lower bound, it forms the confidence interval for the predicted mean.

  • obs_ci_lower: The lower bound of the prediction interval for the observed response. A prediction interval covers the uncertainty in an individual prediction, not just the mean.

  • obs_ci_upper: The upper bound of the prediction interval for the observed response. It is wider than the confidence interval because it includes both the uncertainty in the mean prediction and the variability of the actual observations around the mean.

What’s the difference between confidence and prediction intervals?

  • Confidence Interval (mean_ci_lower and mean_ci_upper): Represents the range in which the true mean prediction is likely to lie. For predicting the average value (e.g., “the average apple weight is between 140 and 160 grams”).
  • Prediction Interval (obs_ci_lower and obs_ci_upper): Represents the range in which an individual new observation is likely to lie. For predicting the range where individual values could fall (e.g., “an individual apple might weigh between 120 and 180 grams”).

Merging the actual values in, so we can compare predicted values vs actual values:

# get all rows from the original dataset that wound up in the test set:
test_set = df.loc[x_test.index].copy()
# merge in the prediction results:
test_set = test_set.merge(preds_df, how="inner", left_index=True, right_index=True)
test_set.rename(columns={"mean":"Prediction", "mean_se": "Standard Error",
                        "mean_ci_upper": "CI Upper", "mean_ci_lower": "CI Lower",
                        "obs_ci_upper": "PI Upper", "obs_ci_lower": "PI Lower",

}, inplace=True)
test_set["My Error"] = test_set["Grade"] - test_set["Prediction"]
test_set
Name StudyHours Grade Prediction Standard Error CI Lower CI Upper PI Lower PI Upper My Error
17 Tariq 6.00 35.0 20.258165 3.468123 12.819780 27.696550 0.161099 40.355231 14.741835
14 Alberto 15.50 70.0 80.713556 4.282255 71.529034 89.898079 59.906876 101.520237 -10.713556
2 Hassan 9.00 47.0 39.349341 2.280844 34.457418 44.241265 20.049253 58.649430 7.650659
10 Leila 15.50 82.0 80.713556 4.282255 71.529034 89.898079 59.906876 101.520237 1.286444
6 Carlos 11.50 53.0 55.258655 2.394199 50.123609 60.393701 35.895515 74.621795 -2.258655
11 Jin 13.75 62.0 69.577037 3.322979 62.449955 76.704119 49.593099 89.560975 -7.577037

Visualizing the predictions:

Code
import plotly.express as px

chart_df = test_set.copy()

# Create the plot
fig = px.scatter(chart_df, x="StudyHours", y="Grade", height=350,
                 title="True vs Predicted Grade (with Pred Interval and Confidence Interval)"
                 )

fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
                mode='lines+markers',
                name='Prediction (with PI)',
                marker=dict(color='mediumpurple'), #size=10, symbol="x"
                error_y=dict(type='data', symmetric=False,
                             array=chart_df['PI Upper'],
                             arrayminus=chart_df['PI Lower']),
                legendrank=1)

fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
                mode='lines+markers',
                name='Prediction (with CI)',
                marker=dict(color='red', size=10, #symbol="x"
                ),
                error_y=dict(type='data', symmetric=False,
                             array=chart_df['CI Upper'],
                             arrayminus=chart_df['CI Lower']),
                legendrank=1)

# Now add the scatter for the true values again, placing this plot behind
# (hack to get the actual values in front)
fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Grade'],
                mode='markers',
                name='True Value',
                marker=dict(color='blue', size=10, symbol="x"),
                legendrank=2)

fig.show()

Evaluating the model using our evaluation metrics from sklearn:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

y_pred = test_set["Prediction"]

r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))

rmse = mse ** .5
print("RMSE:", rmse.round(3))
R^2: 0.678
MAE: 7.371
MSE: 75.8
RMSE: 8.706

We see the same results on the test set that we did with the sklearn regression implementation.