Let’s explore linear regression using an example dataset of student grades. Our goal will be to train a model to predict a student’s grade given the number of hours they have studied.
Data Loading
Loading the data:
from pandas import read_csvrepo_url ="https://raw.githubusercontent.com/prof-rossetti/python-for-finance"request_url =f"{repo_url}/main/docs/data/grades.csv"df = read_csv(request_url)print(len(df))df.head()
24
Name
StudyHours
Grade
0
Arun
10.00
50.0
1
Sofia
11.50
50.0
2
Hassan
9.00
47.0
3
Zara
16.00
97.0
4
Liam
9.25
49.0
Checking for null values:
df["StudyHours"].isna().sum()
1
df.tail()
Name
StudyHours
Grade
19
Maya
12.0
52.0
20
Yusuf
12.5
63.0
21
Zainab
12.0
64.0
22
Juan
8.0
NaN
23
Ali
NaN
NaN
For “Ali”, we don’t have a grade or number of study hours, so we should drop that row.
For “Juan”, since there is no label, we can’t use this record to train the model, but we could use the trained model to predict their grade later (given 8 study hours).
Dropping nulls:
df.dropna(inplace=True)df.tail()
Name
StudyHours
Grade
17
Tariq
6.0
35.0
18
Lakshmi
10.0
48.0
19
Maya
12.0
52.0
20
Yusuf
12.5
63.0
21
Zainab
12.0
64.0
Exploring relationship between variables:
import plotly.express as pxpx.scatter(df, x="StudyHours", y="Grade", height=350, title="Relationship between Study Hours and Grades", trendline="ols", trendline_color_override="red",)
Checking for normality and outliers:
px.violin(df, x="StudyHours", box=True, points="all", height=350, title="Distribution of Study Hours",)
px.violin(df, x="Grade", box=True, points="all", height=350, title="Distribution of Grade")
Data Splitting
X/Y Split
Identifying the dependent and independent variables.
#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array insteadx = df[["StudyHours"]] # model wants x to be a matrix / DataFrameprint(x.shape)y = df["Grade"]print(y.shape)
(22, 1)
(22,)
Note
When using sklearn, we must construct the features as a two-dimensional array (even if the data only contains one column).
Train Test Split
Splitting the data randomly into test and training sets. We will train the model on the training set, and evaluate the model using the test set. This helps for generalizability, and to prevent overfitting.
from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)print("TRAIN:", x_train.shape, y_train.shape)print("TEST:", x_test.shape, y_test.shape)
TRAIN: (16, 1) (16,)
TEST: (6, 1) (6,)
Model Selection and Training
Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
After the model is trained, we have access to the ideal weights (i.e. “coefficients”). There is one coefficient for each feature (in this case only one: number of hours studied).
print("COEFS:", model.coef_.round(3)) # one for each featureprint("Y INTERCEPT:", model.intercept_.round(3))
COEFS: [6.364]
Y INTERCEPT: -17.924
Note
The convention with sklearn models is that any methods or properties ending with an underscore (_), like coef_ and intercept_ are only available after the model has been trained.
When we have multiple coefficients, it will be helpful to wrap them in a Series to see which weights correspond with which features (although in this case there is only one feature):
from pandas import Seriescoefs = Series(model.coef_, index=model.feature_names_in_)print(coefs)
StudyHours 6.363725
dtype: float64
The coefficients and y-intercept tell us the line of best fit:
print("--------------")print(f"EQUATION FOR LINE OF BEST FIT:")print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")
--------------
EQUATION FOR LINE OF BEST FIT:
y = (6.364 * StudyHours) + -17.924
Model Predictions and Evaluation
Alright, we trained the model, but how well does it do in making predictions?
We use the trained model to make predictions on the unseen (test) data:
We can then compare each of the predicted values against the actual known values:
# get all rows from the original dataset that wound up in the test set:test_set = df.loc[x_test.index].copy()# create a column for the predictions:test_set["PredictedGrade"] = y_pred.round(1)# calculate error for each datapoint:test_set["Error"] = (y_pred - y_test).round(1)test_set.sort_values(by="StudyHours", ascending=False)
To get a measure for how well the model did across the entire test dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.
Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called “inference”):