from ucimlrepo import fetch_ucirepo
ds = fetch_ucirepo(id=357)16 Binary Classification
16.1 Data Loading
To illustrate binary classification, we’ll use an “Occupancy Detection” dataset dataset.
“Experimental data used for binary classification (room occupancy) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.”
Loading the dataset:
Inspecting the variables:
ds.variables| name | role | type | demographic | description | units | missing_values | |
|---|---|---|---|---|---|---|---|
| 0 | id | ID | Integer | None | None | None | no |
| 1 | date | Feature | Date | None | None | None | no |
| 2 | Temperature | Feature | Integer | None | None | C | no |
| 3 | Humidity | Feature | Continuous | None | None | % | no |
| 4 | Light | Feature | Integer | None | None | Lux | no |
| 5 | CO2 | Feature | Continuous | None | None | ppm | no |
| 6 | HumidityRatio | Feature | Continuous | None | None | kgwater-vapor/kg-air | no |
| 7 | Occupancy | Target | Binary | None | 0 for not occupied, 1 for occupied status | None | no |
Variable Info (paraphrased from the UCI website):
Date: time in format of “year-month-day hour:minute:second”Temperature: in CelsiusHumidity: relative humidity, as a percentage %Light: in LuxCO2: in ppmHumidityRatio: derived quantity from temperature and relative humidity, in kgwater-vapor/kg-airOccupancy: 0 or 1 (0 for not occupied, 1 for occupied)
Loading the data:
df = ds["data"]["original"].copy()
df.rename(columns={"date": "Date", "Occupancy": "Occupied"}, inplace=True)
df.drop(columns=["id"], inplace=True)
df.head()| Date | Temperature | Humidity | Light | CO2 | HumidityRatio | Occupied | |
|---|---|---|---|---|---|---|---|
| 0 | 2015-02-04 17:51:00 | 23.18 | 27.272 | 426 | 721.25 | 0.00479298817650529 | 1.0 |
| 1 | 2015-02-04 17:51:59 | 23.15 | 27.2675 | 429.5 | 714 | 0.00478344094931065 | 1.0 |
| 2 | 2015-02-04 17:53:00 | 23.15 | 27.245 | 426 | 713.5 | 0.00477946352442199 | 1.0 |
| 3 | 2015-02-04 17:54:00 | 23.15 | 27.2 | 426 | 708.25 | 0.00477150882608175 | 1.0 |
| 4 | 2015-02-04 17:55:00 | 23.1 | 27.2 | 426 | 704.5 | 0.00475699293331518 | 1.0 |
16.2 Data Cleaning
16.2.1 Null Values
Checking for nulls:
df.isnull().sum()Date 0
Temperature 0
Humidity 0
Light 0
CO2 0
HumidityRatio 0
Occupied 2
dtype: int64
Dropping two rows that have null values for the target variable:
print(len(df))
df.dropna(inplace=True)
print(len(df))20562
20560
df.isnull().sum()Date 0
Temperature 0
Humidity 0
Light 0
CO2 0
HumidityRatio 0
Occupied 0
dtype: int64
16.2.2 Datatypes
Upon investigation and performing later analysis, we learn the numerical data is actually currently in a string format (represented by pandas as an “object”):
df.dtypesDate object
Temperature object
Humidity object
Light object
CO2 object
HumidityRatio object
Occupied float64
dtype: object
So we need to clean the data before moving on:
from pandas import to_numeric
numeric_features = ["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"]
df[numeric_features] = df[numeric_features].apply(to_numeric)
df.dtypesDate object
Temperature float64
Humidity float64
Light float64
CO2 float64
HumidityRatio float64
Occupied float64
dtype: object
16.3 Data Exploration
16.3.1 Distribution of the Target
target = "Occupied"
#df[target] = df[target].map({0: False, 1: True})As you can see, the target variable is binary, which makes this a binary classification task:
df[target].value_counts()Occupied
0.0 15810
1.0 4750
Name: count, dtype: int64
import plotly.express as px
px.histogram(df, x=target, nbins=5, height=350, title="Distribution of Occupancy")It doesn’t look like the classes are prohibitively imbalanced.
16.3.2 Relationships
Investigating the relationships between certain variables of interest, to start developing an intuition for the data.
#px.scatter(df, x="Light", y=target, height=350,
# trendline="ols", trendline_color_override="red"
#)Plotting the distribution of light, grouped by occupancy status:
px.histogram(df, x="Light", nbins=7, height=350,
facet_col=target, color=target
)What can we learn about the relationship between light and occupancy?
Plotting the distribution of temperature, grouped by occupancy status:
px.histogram(df, x="Temperature", nbins=7, height=350,
facet_col=target, color=target
)What can we learn about the relationship between temperature and occupancy?
16.3.3 Correlation
Helper function for plotting correlation matrix as a heatmap:
Code
import plotly.express as px
def plot_correlation_matrix(df, method="pearson", height=450, showscale=True):
"""Params: method (str): "spearman" or "pearson". """
cor_mat = df.corr(method=method, numeric_only=True)
title= f"{method.title()} Correlation"
fig = px.imshow(cor_mat,
height=height, # title=title,
text_auto= ".2f", # round to two decimal places
color_continuous_scale="Blues",
color_continuous_midpoint=0,
labels={"x": "Variable", "y": "Variable"},
)
# center title (h/t: https://stackoverflow.com/questions/64571789/)
fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
fig.update_coloraxes(showscale=showscale)
fig.show()Plotting correlation to gain a more formal understanding of the relationships:
plot_correlation_matrix(df, method="spearman", height=450)The feature that is most highly correlated with the target is Light. So we should probably include that feature in our model.
The features that are most highly correlated with each other are Humidity and HumidityRatio. After consulting the data dictionary in more detail, we realize that HumidityRatio was derived from Temperature and Humidity. So for our model, to alleviate collinearity concerns, we should either include one or more of the original features (Temperature and Humidity), or the derived feature (HumidityRatio), but not both.
corr_target = df.corr(numeric_only=True)[target].sort_values(ascending=False)
corr_targetOccupied 1.000000
Light 0.914850
Temperature 0.555610
CO2 0.501582
HumidityRatio 0.257324
Humidity 0.046240
Name: Occupied, dtype: float64
16.4 X/Y Split
# df.columns.tolist()Choosing target and feature variables:
target = "Occupied"
y = df[target].copy()
x = df.drop(columns=[target, "Date", "Humidity", "Temperature"]).copy()
print("X:", x.shape)
print("Y:", y.shape)X: (20560, 3)
Y: (20560,)
x.head()| Light | CO2 | HumidityRatio | |
|---|---|---|---|
| 0 | 426.0 | 721.25 | 0.004793 |
| 1 | 429.5 | 714.00 | 0.004783 |
| 2 | 426.0 | 713.50 | 0.004779 |
| 3 | 426.0 | 708.25 | 0.004772 |
| 4 | 426.0 | 704.50 | 0.004757 |
16.5 Feature Scaling
Scaling the features, to express their values on a similar scale:
x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)
x_scaled.head()| Light | CO2 | HumidityRatio | |
|---|---|---|---|
| 0 | 1.403042 | 0.098639 | 0.735378 |
| 1 | 1.419675 | 0.075343 | 0.722945 |
| 2 | 1.403042 | 0.073736 | 0.717765 |
| 3 | 1.403042 | 0.056866 | 0.707405 |
| 4 | 1.403042 | 0.044816 | 0.688501 |
Verifying mean centering and unit variance:
x_scaled.describe().T[["mean", "std"]]| mean | std | |
|---|---|---|
| Light | 1.935330e-17 | 1.0 |
| CO2 | 2.432987e-16 | 1.0 |
| HumidityRatio | 6.082467e-16 | 1.0 |
16.6 Train Test Split
Splitting the data into training and test sets:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)TRAIN: (15420, 3) (15420,)
TEST: (5140, 3) (5140,)
16.7 Model Training
Training the model on the training data:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=99)
model.fit(x_train, y_train)LogisticRegression(random_state=99)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=99)
Examining coefficients:
from pandas import Series
coef = Series(model.coef_[0], index=x_train.columns)
coef.sort_values(ascending=False)Light 4.611785
CO2 0.944159
HumidityRatio 0.231324
dtype: float64
What can we learn from these coefficients?
16.8 Model Evaluation
Predicting values for the unseen test data:
y_pred = model.predict(x_test)Evaluating the model’s performance using the normal classification metrics:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)) precision recall f1-score support
0.0 1.00 0.99 0.99 3933
1.0 0.96 1.00 0.98 1207
accuracy 0.99 5140
macro avg 0.98 0.99 0.98 5140
weighted avg 0.99 0.99 0.99 5140
Computing the ROC-AUC score as an additional classification metric:
from sklearn.metrics import roc_auc_score
print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))ROC-AUC: 0.992
Alright, looks like our model is doing really great overall!
16.8.1 Confusion Matrix
Helper function for plotting a confusion matrix as a heatmap:
Code
from sklearn.metrics import confusion_matrix
import plotly.express as px
def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
# Confusion matrix whose i-th row and j-th column
# ... indicates the number of samples with
# ... true label being i-th class (ROW)
# ... and predicted label being j-th class (COLUMN)
cm = confusion_matrix(y_true, y_pred)
class_names = sorted(y_test.unique().tolist())
cm = confusion_matrix(y_test, y_pred, labels=class_names)
title = title or "Confusion Matrix"
#if subtitle:
# title += f"<br><sup>{subtitle}</sup>"
fig = px.imshow(cm, x=class_names, y=class_names, height=height,
labels={"x": "Predicted", "y": "Actual"},
color_continuous_scale="Blues", text_auto=True,
)
fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
fig.update_coloraxes(showscale=showscale)
fig.show()Examining predicted vs actual values:
plot_confusion_matrix(y_test, y_pred, height=400)Is one of the classes getting mis-classified more than the other?
16.9 Complexity vs Performance
Now that we have an idea for the performance of a model that uses all available features, let’s try different combinations of features to see if there is a simpler model that performs almost as well.
Helper function for training and evaluating a model given a list of features:
Code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pandas import Series
from sklearn.metrics import classification_report, roc_auc_score
def train_eval_logistic(df, target="Occupied", features=[]):
if not any(features):
features = df.drop(columns=[target]).columns.tolist()
print("FEATURES:", features)
x = df[features].copy()
print("X:", x.shape)
y = df[target].copy()
print("Y:", y.shape)
# SCALING:
x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)
# TRAIN / TEST SPLIT:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
# MODEL TRAINING:
model = LogisticRegression(random_state=99)
model.fit(x_train, y_train)
#print("COEFS:")
#coef = Series(model.coef_[0], index=x_train.columns)
#print(coef.sort_values(ascending=False))
# PREDS AND EVAL:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))Model performance using all original features:
train_eval_logistic(df, features=["Light", "Temperature", "Humidity", "CO2"])FEATURES: ['Light', 'Temperature', 'Humidity', 'CO2']
X: (20560, 4)
Y: (20560,)
precision recall f1-score support
0.0 1.00 0.99 0.99 3933
1.0 0.96 1.00 0.98 1207
accuracy 0.99 5140
macro avg 0.98 0.99 0.99 5140
weighted avg 0.99 0.99 0.99 5140
ROC-AUC: 0.993
Model performance using the derived feature:
train_eval_logistic(df, features=["Light", "HumidityRatio", "CO2"])FEATURES: ['Light', 'HumidityRatio', 'CO2']
X: (20560, 3)
Y: (20560,)
precision recall f1-score support
0.0 1.00 0.99 0.99 3933
1.0 0.96 1.00 0.98 1207
accuracy 0.99 5140
macro avg 0.98 0.99 0.98 5140
weighted avg 0.99 0.99 0.99 5140
ROC-AUC: 0.992
Models using just a single-feature:
train_eval_logistic(df, features=["Light"])FEATURES: ['Light']
X: (20560, 1)
Y: (20560,)
precision recall f1-score support
0.0 1.00 0.98 0.99 3933
1.0 0.95 1.00 0.97 1207
accuracy 0.99 5140
macro avg 0.97 0.99 0.98 5140
weighted avg 0.99 0.99 0.99 5140
ROC-AUC: 0.991
train_eval_logistic(df, features=["HumidityRatio"])FEATURES: ['HumidityRatio']
X: (20560, 1)
Y: (20560,)
precision recall f1-score support
0.0 0.77 1.00 0.87 3933
1.0 0.86 0.05 0.10 1207
accuracy 0.78 5140
macro avg 0.82 0.52 0.48 5140
weighted avg 0.79 0.78 0.69 5140
ROC-AUC: 0.524
train_eval_logistic(df, features=["CO2"])FEATURES: ['CO2']
X: (20560, 1)
Y: (20560,)
precision recall f1-score support
0.0 0.81 0.93 0.87 3933
1.0 0.57 0.30 0.40 1207
accuracy 0.78 5140
macro avg 0.69 0.62 0.63 5140
weighted avg 0.76 0.78 0.76 5140
ROC-AUC: 0.616
Wow it looks like Light is really good by itself!
Models using two features (Light and something else):
train_eval_logistic(df, features=["Light", "HumidityRatio"])FEATURES: ['Light', 'HumidityRatio']
X: (20560, 2)
Y: (20560,)
precision recall f1-score support
0.0 1.00 0.98 0.99 3933
1.0 0.95 1.00 0.97 1207
accuracy 0.99 5140
macro avg 0.98 0.99 0.98 5140
weighted avg 0.99 0.99 0.99 5140
ROC-AUC: 0.992
train_eval_logistic(df, features=["Light", "CO2"])FEATURES: ['Light', 'CO2']
X: (20560, 2)
Y: (20560,)
precision recall f1-score support
0.0 1.00 0.99 0.99 3933
1.0 0.96 1.00 0.98 1207
accuracy 0.99 5140
macro avg 0.98 0.99 0.99 5140
weighted avg 0.99 0.99 0.99 5140
ROC-AUC: 0.992
Which features would you choose for your final model?