Data Exploration

Linear Regression with sklearn

Let’s explore linear regression using an example dataset of student grades. Our goal will be to train a model to predict a student’s grade given the number of hours they have studied.

Data Loading

Loading the data:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/grades.csv"

df = read_csv(request_url)
print(len(df))
df.head()
24
Name StudyHours Grade
0 Arun 10.00 50.0
1 Sofia 11.50 50.0
2 Hassan 9.00 47.0
3 Zara 16.00 97.0
4 Liam 9.25 49.0

Checking for null values:

df["StudyHours"].isna().sum()
1
df.tail()
Name StudyHours Grade
19 Maya 12.0 52.0
20 Yusuf 12.5 63.0
21 Zainab 12.0 64.0
22 Juan 8.0 NaN
23 Ali NaN NaN

For “Ali”, we don’t have a grade or number of study hours, so we should drop that row.

For “Juan”, since there is no label, we can’t use this record to train the model, but we could use the trained model to predict their grade later (given 8 study hours).

Dropping nulls:

df.dropna(inplace=True)
df.tail()
Name StudyHours Grade
17 Tariq 6.0 35.0
18 Lakshmi 10.0 48.0
19 Maya 12.0 52.0
20 Yusuf 12.5 63.0
21 Zainab 12.0 64.0

Exploring relationship between variables:

import plotly.express as px

px.scatter(df, x="StudyHours", y="Grade", height=350,
            title="Relationship between Study Hours and Grades",
            trendline="ols", trendline_color_override="red",
)

Checking for normality and outliers:

px.violin(df, x="StudyHours", box=True, points="all", height=350,
    title="Distribution of Study Hours",
)
px.violin(df, x="Grade", box=True, points="all", height=350,
            title="Distribution of Grade"
)

Data Splitting

X/Y Split

Identifying the dependent and independent variables.

#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array instead
x = df[["StudyHours"]] # model wants x to be a matrix / DataFrame
print(x.shape)

y = df["Grade"]
print(y.shape)
(22, 1)
(22,)
Note

When using sklearn, we must construct the features as a two-dimensional array (even if the data only contains one column).

Train Test Split

Splitting the data randomly into test and training sets. We will train the model on the training set, and evaluate the model using the test set. This helps for generalizability, and to prevent overfitting.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (16, 1) (16,)
TEST: (6, 1) (6,)

Model Selection and Training

Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After the model is trained, we have access to the ideal weights (i.e. “coefficients”). There is one coefficient for each feature (in this case only one: number of hours studied).

print("COEFS:", model.coef_.round(3)) # one for each feature
print("Y INTERCEPT:", model.intercept_.round(3))
COEFS: [6.364]
Y INTERCEPT: -17.924
Note

The convention with sklearn models is that any methods or properties ending with an underscore (_), like coef_ and intercept_ are only available after the model has been trained.

When we have multiple coefficients, it will be helpful to wrap them in a Series to see which weights correspond with which features (although in this case there is only one feature):

from pandas import Series

coefs = Series(model.coef_, index=model.feature_names_in_)
print(coefs)
StudyHours    6.363725
dtype: float64

The coefficients and y-intercept tell us the line of best fit:

print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")
--------------
EQUATION FOR LINE OF BEST FIT:
y = (6.364 * StudyHours) + -17.924

Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

y_pred = model.predict(x_test)
print(y_pred)
[20.25816529 80.71355636 39.34934142 80.71355636 55.25865485 69.57703695]

We can then compare each of the predicted values against the actual known values:

# get all rows from the original dataset that wound up in the test set:
test_set = df.loc[x_test.index].copy()

# create a column for the predictions:
test_set["PredictedGrade"] = y_pred.round(1)

# calculate error for each datapoint:
test_set["Error"] = (y_pred - y_test).round(1)

test_set.sort_values(by="StudyHours", ascending=False)
Name StudyHours Grade PredictedGrade Error
14 Alberto 15.50 70.0 80.7 10.7
10 Leila 15.50 82.0 80.7 -1.3
11 Jin 13.75 62.0 69.6 7.6
6 Carlos 11.50 53.0 55.3 2.3
2 Hassan 9.00 47.0 39.3 -7.7
17 Tariq 6.00 35.0 20.3 -14.7

Plotting the errors on a graph:

px.scatter(test_set, x="StudyHours", y=["Grade", "PredictedGrade"],
           hover_data="Name", height=350,
           title=f"Prediction errors (test set)",
           labels={"value":""}
)

To get a measure for how well the model did across the entire test dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.

It is possible for us to compute our own metrics:

my_mae = test_set["Error"].abs().mean()
print("MY MAE:", my_mae.round(3))
MY MAE: 7.383
my_mse = (test_set["Error"] ** 2).mean()
print("MY MSE:", my_mse.round(1))
MY MSE: 75.8

However more commonly we will use metric functions from the sklearn.metrics submodule:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))
R^2: 0.678
MAE: 7.371
MSE: 75.8
rmse = mse ** .5
print("RMSE:", rmse.round(3))
RMSE: 8.706

Inference

Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called “inference”):

from pandas import DataFrame

x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})

model.predict(x_new)
array([-17.92418697,   7.53071454,  32.98561604,  58.44051754,
        83.89541905, 109.35032055])

Alright, we have trained a model and used it to make predictions.