from pandas import read_csvrequest_url ="https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/grades.csv"df = read_csv(request_url)df.head()
Name
StudyHours
Grade
0
Arun
10.00
50.0
1
Sofia
11.50
50.0
2
Hassan
9.00
47.0
3
Zara
16.00
97.0
4
Liam
9.25
49.0
Data Exploration
Checking for null values:
df["StudyHours"].isna().sum()
np.int64(1)
df.tail()
Name
StudyHours
Grade
19
Maya
12.0
52.0
20
Yusuf
12.5
63.0
21
Zainab
12.0
64.0
22
Juan
8.0
NaN
23
Ali
NaN
NaN
Dropping null values:
df.dropna(inplace=True)df.tail()
Name
StudyHours
Grade
17
Tariq
6.0
35.0
18
Lakshmi
10.0
48.0
19
Maya
12.0
52.0
20
Yusuf
12.5
63.0
21
Zainab
12.0
64.0
Exploring relationship between variables:
import plotly.express as pxpx.scatter(df, x="StudyHours", y="Grade", height=350, title="Relationship between Study Hours and Grades", trendline="ols", trendline_color_override="red",)
Checking for normality and outliers:
px.violin(df, x="StudyHours", box=True, points="all", height=350, title="Distribution of Study Hours",)
px.violin(df, x="Grade", box=True, points="all", height=350, title="Distribution of Grade")
Data Splitting
X/Y Split
Identifying the dependent and independent variables:
x = df["StudyHours"]print(x.shape)y = df["Grade"]print(y.shape)
(22,)
(22,)
Adding Constants
Note
When using statsmodels, the documentation instructs us to manually add a column of ones (to help the model perform calculations related to the y-intercept):
import statsmodels.api as smx = sm.add_constant(x) # adding in a column of constants, as per the OLS docsx.head()
const
StudyHours
0
1.0
10.00
1
1.0
11.50
2
1.0
9.00
3
1.0
16.00
4
1.0
9.25
Train Test Split
Now we split the training and test sets:
from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)print("TRAIN:", x_train.shape, y_train.shape)print("TEST:", x_test.shape, y_test.shape)
TRAIN: (16, 2) (16,)
TEST: (6, 2) (6,)
Model Selection and Training
Selecting a linear regression (OLS) model, and training it on the training data to learn the ideal weights:
import statsmodels.api as smmodel = sm.OLS(y_train, x_train, missing="drop")print(type(model))results = model.fit()print(type(results))
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:418: UserWarning:
`kurtosistest` p-value may be inaccurate with fewer than 20 observations; only n=16 observations were given.
Training vs Testing Metrics
The training results contain an r-squared score, however this represents the error for the training data. To get the real results of how the model generalizes to the test data, we will calculate the r-squared score and other metrics on the test results later.
Interpreting P-values
In the OLS training results summary, the column labeled P>|t| represents the p-value for the corresponding t-statistic of each coefficient.
The t-statistic is used to test whether a particular coefficient is significantly different from zero. The p-value (P>|t|) tells you the probability of observing a t-statistic at least as extreme as the one calculated, assuming the null hypothesis is true (where the null hypothesis typically posits that the coefficient is zero, meaning the feature has no significant effect on the dependent variable).
A low p-value (typically less than 0.05) suggests that you can reject the null hypothesis, meaning the coefficient is statistically significant and likely has an impact on the dependent variable.
A high p-value (greater than 0.05) indicates that the coefficient is not statistically significant, implying that the feature may not contribute meaningfully to the model.
The part of the training results we care about are the the learned weights (i.e. coefficients), which we use to arrive at the line of best fit:
The training results also contain the fittedvalues (predictions), as well as the resid (residuals or errors). We can compare each of the predicted values against the actual known values, to verify the residuals for illustration purposes:
from pandas import DataFrame# get all rows from the original dataset that wound up in the training set:training_set = df.loc[x_train.index].copy()# create a dataset for the predictions and the residuals:training_preds = DataFrame({"Predictions": results.fittedvalues,"Residuals": results.resid})# merge the training set with the results:training_set = training_set.merge(training_preds, how="inner", left_index=True, right_index=True)# calculate error for each datapoint:training_set["My Error"] = training_set["Grade"] - training_set["Predictions"]training_set
Name
StudyHours
Grade
Predictions
Residuals
My Error
15
Anika
8.00
27.0
32.985616
-5.985616
-5.985616
0
Arun
10.00
50.0
45.713067
4.286933
4.286933
18
Lakshmi
10.00
48.0
45.713067
2.286933
2.286933
13
Mei
8.00
15.0
32.985616
-17.985616
-17.985616
12
Priya
9.00
37.0
39.349341
-2.349341
-2.349341
20
Yusuf
12.50
63.0
61.622380
1.377620
1.377620
7
Kwame
9.00
42.0
39.349341
2.650659
2.650659
21
Zainab
12.00
64.0
58.440518
5.559482
5.559482
16
Elif
9.00
36.0
39.349341
-3.349341
-3.349341
5
Xia
1.00
3.0
-11.560462
14.560462
14.560462
4
Liam
9.25
49.0
40.940273
8.059727
8.059727
19
Maya
12.00
52.0
58.440518
-6.440518
-6.440518
9
Takumi
14.50
74.0
74.349831
-0.349831
-0.349831
8
Fatima
8.50
26.0
36.167479
-10.167479
-10.167479
3
Zara
16.00
97.0
83.895419
13.104581
13.104581
1
Sofia
11.50
50.0
55.258655
-5.258655
-5.258655
It is possible to calculate the training metrics ourselves, to verify the regression results summary we saw above:
When we use the summary_frame method on prediction results, it returns a DataFrame with several columns. Here’s a breakdown of what they mean:
mean: This is the predicted value for each observation based on the model. It’s essentially the point prediction (ŷ) for the corresponding input.
mean_se: This stands for the standard error of the predicted mean. It measures the uncertainty associated with the predicted value due to sampling variability. A smaller mean_se indicates higher confidence in the predicted mean.
mean_ci_lower: The lower bound of the confidence interval for the predicted mean. The confidence interval quantifies the uncertainty of the predicted mean.
mean_ci_upper: The upper bound of the confidence interval for the predicted mean. Together with the lower bound, it forms the confidence interval for the predicted mean.
obs_ci_lower: The lower bound of the prediction interval for the observed response. A prediction interval covers the uncertainty in an individual prediction, not just the mean.
obs_ci_upper: The upper bound of the prediction interval for the observed response. It is wider than the confidence interval because it includes both the uncertainty in the mean prediction and the variability of the actual observations around the mean.
What’s the difference between confidence and prediction intervals?
Confidence Interval (mean_ci_lower and mean_ci_upper): Represents the range in which the true mean prediction is likely to lie. For predicting the average value (e.g., “the average apple weight is between 140 and 160 grams”).
Prediction Interval (obs_ci_lower and obs_ci_upper): Represents the range in which an individual new observation is likely to lie. For predicting the range where individual values could fall (e.g., “an individual apple might weigh between 120 and 180 grams”).
Merging the actual values in, so we can compare predicted values vs actual values:
# get all rows from the original dataset that wound up in the test set:test_set = df.loc[x_test.index].copy()# merge in the prediction results:test_set = test_set.merge(preds_df, how="inner", left_index=True, right_index=True)test_set.rename(columns={"mean":"Prediction", "mean_se": "Standard Error","mean_ci_upper": "CI Upper", "mean_ci_lower": "CI Lower","obs_ci_upper": "PI Upper", "obs_ci_lower": "PI Lower",}, inplace=True)test_set["My Error"] = test_set["Grade"] - test_set["Prediction"]test_set
Name
StudyHours
Grade
Prediction
Standard Error
CI Lower
CI Upper
PI Lower
PI Upper
My Error
17
Tariq
6.00
35.0
20.258165
3.468123
12.819780
27.696550
0.161099
40.355231
14.741835
14
Alberto
15.50
70.0
80.713556
4.282255
71.529034
89.898079
59.906876
101.520237
-10.713556
2
Hassan
9.00
47.0
39.349341
2.280844
34.457418
44.241265
20.049253
58.649430
7.650659
10
Leila
15.50
82.0
80.713556
4.282255
71.529034
89.898079
59.906876
101.520237
1.286444
6
Carlos
11.50
53.0
55.258655
2.394199
50.123609
60.393701
35.895515
74.621795
-2.258655
11
Jin
13.75
62.0
69.577037
3.322979
62.449955
76.704119
49.593099
89.560975
-7.577037
Visualizing the predictions:
Code
import plotly.express as pxchart_df = test_set.copy()# Create the plotfig = px.scatter(chart_df, x="StudyHours", y="Grade", height=350, title="True vs Predicted Grade (with Pred Interval and Confidence Interval)" )fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'], mode='lines+markers', name='Prediction (with PI)', marker=dict(color='mediumpurple'), #size=10, symbol="x" error_y=dict(type='data', symmetric=False, array=chart_df['PI Upper'], arrayminus=chart_df['PI Lower']), legendrank=1)fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'], mode='lines+markers', name='Prediction (with CI)', marker=dict(color='red', size=10, #symbol="x" ), error_y=dict(type='data', symmetric=False, array=chart_df['CI Upper'], arrayminus=chart_df['CI Lower']), legendrank=1)# Now add the scatter for the true values again, placing this plot behind# (hack to get the actual values in front)fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Grade'], mode='markers', name='True Value', marker=dict(color='blue', size=10, symbol="x"), legendrank=2)fig.show()
Evaluating the model using our evaluation metrics from sklearn: