Data Exploration

Linear Regression with `sklearn`

Let’s explore linear regression using an example dataset of student grades. Our goal will be to train a model to predict a student’s grade given the number of hours they have studied.

Data Loading

Loading the data:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/grades.csv"

df = read_csv(request_url)
print(len(df))
df.head()

	Name	StudyHours	Grade
0	Arun	10.00	50.0
1	Sofia	11.50	50.0
2	Hassan	9.00	47.0
3	Zara	16.00	97.0
4	Liam	9.25	49.0

Checking for null values:

df["StudyHours"].isna().sum()

df.tail()

	Name	StudyHours	Grade
19	Maya	12.0	52.0
20	Yusuf	12.5	63.0
21	Zainab	12.0	64.0
22	Juan	8.0	NaN
23	Ali	NaN	NaN

For “Ali”, we don’t have a grade or number of study hours, so we should drop that row.

For “Juan”, since there is no label, we can’t use this record to train the model, but we could use the trained model to predict their grade later (given 8 study hours).

Dropping nulls:

df.dropna(inplace=True)
df.tail()

	Name	StudyHours	Grade
17	Tariq	6.0	35.0
18	Lakshmi	10.0	48.0
19	Maya	12.0	52.0
20	Yusuf	12.5	63.0
21	Zainab	12.0	64.0

Exploring relationship between variables:

import plotly.express as px

px.scatter(df, x="StudyHours", y="Grade", height=350,
            title="Relationship between Study Hours and Grades",
            trendline="ols", trendline_color_override="red",
)

Checking for normality and outliers:

px.violin(df, x="StudyHours", box=True, points="all", height=350,
    title="Distribution of Study Hours",
)

px.violin(df, x="Grade", box=True, points="all", height=350,
            title="Distribution of Grade"
)

Data Splitting

X/Y Split

Identifying the dependent and independent variables.

#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array instead
x = df[["StudyHours"]] # model wants x to be a matrix / DataFrame
print(x.shape)

y = df["Grade"]
print(y.shape)

(22, 1)
(22,)

Note

When using sklearn, we must construct the features as a two-dimensional array (even if the data only contains one column).

Train Test Split

Splitting the data randomly into test and training sets. We will train the model on the training set, and evaluate the model using the test set. This helps for generalizability, and to prevent overfitting.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)

TRAIN: (16, 1) (16,)
TEST: (6, 1) (6,)

Model Selection and Training

Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After the model is trained, we have access to the ideal weights (i.e. “coefficients”). There is one coefficient for each feature (in this case only one: number of hours studied).

print("COEFS:", model.coef_.round(3)) # one for each feature
print("Y INTERCEPT:", model.intercept_.round(3))

COEFS: [6.364]
Y INTERCEPT: -17.924

Note

The convention with sklearn models is that any methods or properties ending with an underscore (_), like coef_ and intercept_ are only available after the model has been trained.

When we have multiple coefficients, it will be helpful to wrap them in a Series to see which weights correspond with which features (although in this case there is only one feature):

from pandas import Series

coefs = Series(model.coef_, index=model.feature_names_in_)
print(coefs)

StudyHours    6.363725
dtype: float64

The coefficients and y-intercept tell us the line of best fit:

print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")

--------------
EQUATION FOR LINE OF BEST FIT:
y = (6.364 * StudyHours) + -17.924

Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

y_pred = model.predict(x_test)
print(y_pred)

[20.25816529 80.71355636 39.34934142 80.71355636 55.25865485 69.57703695]

We can then compare each of the predicted values against the actual known values:

# get all rows from the original dataset that wound up in the test set:
test_set = df.loc[x_test.index].copy()

# create a column for the predictions:
test_set["PredictedGrade"] = y_pred.round(1)

# calculate error for each datapoint:
test_set["Error"] = (y_pred - y_test).round(1)

test_set.sort_values(by="StudyHours", ascending=False)

	Name	StudyHours	Grade	PredictedGrade	Error
14	Alberto	15.50	70.0	80.7	10.7
10	Leila	15.50	82.0	80.7	-1.3
11	Jin	13.75	62.0	69.6	7.6
6	Carlos	11.50	53.0	55.3	2.3
2	Hassan	9.00	47.0	39.3	-7.7
17	Tariq	6.00	35.0	20.3	-14.7

Plotting the errors on a graph:

px.scatter(test_set, x="StudyHours", y=["Grade", "PredictedGrade"],
           hover_data="Name", height=350,
           title=f"Prediction errors (test set)",
           labels={"value":""}
)

To get a measure for how well the model did across the entire test dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.

It is possible for us to compute our own metrics:

my_mae = test_set["Error"].abs().mean()
print("MY MAE:", my_mae.round(3))

MY MAE: 7.383

my_mse = (test_set["Error"] ** 2).mean()
print("MY MSE:", my_mse.round(1))

MY MSE: 75.8

However more commonly we will use metric functions from the sklearn.metrics submodule:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))

R^2: 0.678
MAE: 7.371
MSE: 75.8

rmse = mse ** .5
print("RMSE:", rmse.round(3))

RMSE: 8.706

Inference

Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called “inference”):

from pandas import DataFrame

x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})

model.predict(x_new)

array([-17.92418697,   7.53071454,  32.98561604,  58.44051754,
        83.89541905, 109.35032055])

Alright, we have trained a model and used it to make predictions.

Linear Regression with sklearn

Data Loading

Data Splitting

X/Y Split

Train Test Split

Model Selection and Training

Model Predictions and Evaluation

Inference

Linear Regression with `sklearn`