16 Binary Classification

16.1 Data Loading

To illustrate binary classification, we’ll use an “Occupancy Detection” dataset dataset.

Data Source

“Experimental data used for binary classification (room occupancy) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.”

Loading the dataset:

from ucimlrepo import fetch_ucirepo

ds = fetch_ucirepo(id=357)

Inspecting the variables:

ds.variables

	name	role	type	demographic	description	units	missing_values
0	id	ID	Integer	None	None	None	no
1	date	Feature	Date	None	None	None	no
2	Temperature	Feature	Integer	None	None	C	no
3	Humidity	Feature	Continuous	None	None	%	no
4	Light	Feature	Integer	None	None	Lux	no
5	CO2	Feature	Continuous	None	None	ppm	no
6	HumidityRatio	Feature	Continuous	None	None	kgwater-vapor/kg-air	no
7	Occupancy	Target	Binary	None	0 for not occupied, 1 for occupied status	None	no

Data Dictionary

Variable Info (paraphrased from the UCI website):

Date: time in format of “year-month-day hour:minute:second”
Temperature: in Celsius
Humidity: relative humidity, as a percentage %
Light: in Lux
CO2: in ppm
HumidityRatio: derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
Occupancy: 0 or 1 (0 for not occupied, 1 for occupied)

Loading the data:

df = ds["data"]["original"].copy()
df.rename(columns={"date": "Date", "Occupancy": "Occupied"}, inplace=True)
df.drop(columns=["id"], inplace=True)
df.head()

	Date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupied
0	2015-02-04 17:51:00	23.18	27.272	426	721.25	0.00479298817650529	1.0
1	2015-02-04 17:51:59	23.15	27.2675	429.5	714	0.00478344094931065	1.0
2	2015-02-04 17:53:00	23.15	27.245	426	713.5	0.00477946352442199	1.0
3	2015-02-04 17:54:00	23.15	27.2	426	708.25	0.00477150882608175	1.0
4	2015-02-04 17:55:00	23.1	27.2	426	704.5	0.00475699293331518	1.0

16.2 Data Cleaning

16.2.1 Null Values

Checking for nulls:

df.isnull().sum()

Date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupied         2
dtype: int64

Dropping two rows that have null values for the target variable:

print(len(df))

df.dropna(inplace=True)

print(len(df))

20562
20560

df.isnull().sum()

Date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupied         0
dtype: int64

16.2.2 Datatypes

Upon investigation and performing later analysis, we learn the numerical data is actually currently in a string format (represented by pandas as an “object”):

df.dtypes

Date              object
Temperature       object
Humidity          object
Light             object
CO2               object
HumidityRatio     object
Occupied         float64
dtype: object

So we need to clean the data before moving on:

from pandas import to_numeric

numeric_features = ["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"]
df[numeric_features] = df[numeric_features].apply(to_numeric)

df.dtypes

Date              object
Temperature      float64
Humidity         float64
Light            float64
CO2              float64
HumidityRatio    float64
Occupied         float64
dtype: object

16.3 Data Exploration

16.3.1 Distribution of the Target

target = "Occupied"
#df[target] = df[target].map({0: False, 1: True})

As you can see, the target variable is binary, which makes this a binary classification task:

df[target].value_counts()

Occupied
0.0    15810
1.0     4750
Name: count, dtype: int64

import plotly.express as px

px.histogram(df, x=target, nbins=5, height=350, title="Distribution of Occupancy")

It doesn’t look like the classes are prohibitively imbalanced.

16.3.2 Relationships

Investigating the relationships between certain variables of interest, to start developing an intuition for the data.

#px.scatter(df, x="Light", y=target, height=350,
#           trendline="ols", trendline_color_override="red"
#)

Plotting the distribution of light, grouped by occupancy status:

px.histogram(df, x="Light", nbins=7, height=350,
             facet_col=target, color=target
            )

What can we learn about the relationship between light and occupancy?

Plotting the distribution of temperature, grouped by occupancy status:

px.histogram(df, x="Temperature", nbins=7, height=350,
             facet_col=target, color=target
            )

What can we learn about the relationship between temperature and occupancy?

16.3.3 Correlation

Helper function for plotting correlation matrix as a heatmap:

Code

import plotly.express as px

def plot_correlation_matrix(df, method="pearson", height=450, showscale=True):
    """Params: method (str): "spearman" or "pearson". """

    cor_mat = df.corr(method=method, numeric_only=True)

    title= f"{method.title()} Correlation"

    fig = px.imshow(cor_mat,
                    height=height, # title=title,
                    text_auto= ".2f", # round to two decimal places
                    color_continuous_scale="Blues",
                    color_continuous_midpoint=0,
                    labels={"x": "Variable", "y": "Variable"},
    )
    # center title (h/t: https://stackoverflow.com/questions/64571789/)
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()

Plotting correlation to gain a more formal understanding of the relationships:

plot_correlation_matrix(df, method="spearman", height=450)

The feature that is most highly correlated with the target is Light. So we should probably include that feature in our model.

The features that are most highly correlated with each other are Humidity and HumidityRatio. After consulting the data dictionary in more detail, we realize that HumidityRatio was derived from Temperature and Humidity. So for our model, to alleviate collinearity concerns, we should either include one or more of the original features (Temperature and Humidity), or the derived feature (HumidityRatio), but not both.

corr_target = df.corr(numeric_only=True)[target].sort_values(ascending=False)
corr_target

Occupied         1.000000
Light            0.914850
Temperature      0.555610
CO2              0.501582
HumidityRatio    0.257324
Humidity         0.046240
Name: Occupied, dtype: float64

16.4 X/Y Split

# df.columns.tolist()

Choosing target and feature variables:

target = "Occupied"
y = df[target].copy()

x = df.drop(columns=[target, "Date", "Humidity", "Temperature"]).copy()
print("X:", x.shape)
print("Y:", y.shape)

X: (20560, 3)
Y: (20560,)

x.head()

	Light	CO2	HumidityRatio
0	426.0	721.25	0.004793
1	429.5	714.00	0.004783
2	426.0	713.50	0.004779
3	426.0	708.25	0.004772
4	426.0	704.50	0.004757

16.5 Feature Scaling

Scaling the features, to express their values on a similar scale:

x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)
x_scaled.head()

	Light	CO2	HumidityRatio
0	1.403042	0.098639	0.735378
1	1.419675	0.075343	0.722945
2	1.403042	0.073736	0.717765
3	1.403042	0.056866	0.707405
4	1.403042	0.044816	0.688501

Verifying mean centering and unit variance:

x_scaled.describe().T[["mean", "std"]]

	mean	std
Light	1.935330e-17	1.0
CO2	2.432987e-16	1.0
HumidityRatio	6.082467e-16	1.0

16.6 Train Test Split

Splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)

TRAIN: (15420, 3) (15420,)
TEST: (5140, 3) (5140,)

16.7 Model Training

Training the model on the training data:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=99)
model.fit(x_train, y_train)

LogisticRegression(random_state=99)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Examining coefficients:

from pandas import Series

coef = Series(model.coef_[0], index=x_train.columns)
coef.sort_values(ascending=False)

Light            4.611785
CO2              0.944159
HumidityRatio    0.231324
dtype: float64

What can we learn from these coefficients?

16.8 Model Evaluation

Predicting values for the unseen test data:

y_pred = model.predict(x_test)

Evaluating the model’s performance using the normal classification metrics:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

Computing the ROC-AUC score as an additional classification metric:

from sklearn.metrics import roc_auc_score

print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))

ROC-AUC: 0.992

Alright, looks like our model is doing really great overall!

16.8.1 Confusion Matrix

Helper function for plotting a confusion matrix as a heatmap:

Code

from sklearn.metrics import confusion_matrix
import plotly.express as px

def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Confusion matrix whose i-th row and j-th column
    # ... indicates the number of samples with
    # ... true label being i-th class (ROW)
    # ... and predicted label being j-th class (COLUMN)
    cm = confusion_matrix(y_true, y_pred)

    class_names = sorted(y_test.unique().tolist())

    cm = confusion_matrix(y_test, y_pred, labels=class_names)

    title = title or "Confusion Matrix"
    #if subtitle:
    #    title += f"<br><sup>{subtitle}</sup>"

    fig = px.imshow(cm, x=class_names, y=class_names, height=height,
                    labels={"x": "Predicted", "y": "Actual"},
                    color_continuous_scale="Blues", text_auto=True,
    )
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()

Examining predicted vs actual values:

plot_confusion_matrix(y_test, y_pred, height=400)

Is one of the classes getting mis-classified more than the other?

16.9 Complexity vs Performance

Now that we have an idea for the performance of a model that uses all available features, let’s try different combinations of features to see if there is a simpler model that performs almost as well.

Helper function for training and evaluating a model given a list of features:

Code

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pandas import Series
from sklearn.metrics import classification_report, roc_auc_score

def train_eval_logistic(df, target="Occupied", features=[]):
    if not any(features):
        features = df.drop(columns=[target]).columns.tolist()
    print("FEATURES:", features)

    x = df[features].copy()
    print("X:", x.shape)

    y = df[target].copy()
    print("Y:", y.shape)

    # SCALING:
    x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)

    # TRAIN / TEST SPLIT:
    x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
    # MODEL TRAINING:
    model = LogisticRegression(random_state=99)
    model.fit(x_train, y_train)

    #print("COEFS:")
    #coef = Series(model.coef_[0], index=x_train.columns)
    #print(coef.sort_values(ascending=False))

    # PREDS AND EVAL:
    y_pred = model.predict(x_test)

    print(classification_report(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))

Model performance using all original features:

train_eval_logistic(df, features=["Light", "Temperature", "Humidity", "CO2"])

FEATURES: ['Light', 'Temperature', 'Humidity', 'CO2']
X: (20560, 4)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.99      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.993

Model performance using the derived feature:

train_eval_logistic(df, features=["Light", "HumidityRatio", "CO2"])

FEATURES: ['Light', 'HumidityRatio', 'CO2']
X: (20560, 3)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992

Models using just a single-feature:

train_eval_logistic(df, features=["Light"])

FEATURES: ['Light']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      3933
         1.0       0.95      1.00      0.97      1207

    accuracy                           0.99      5140
   macro avg       0.97      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.991

train_eval_logistic(df, features=["HumidityRatio"])

FEATURES: ['HumidityRatio']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       0.77      1.00      0.87      3933
         1.0       0.86      0.05      0.10      1207

    accuracy                           0.78      5140
   macro avg       0.82      0.52      0.48      5140
weighted avg       0.79      0.78      0.69      5140

ROC-AUC: 0.524

train_eval_logistic(df, features=["CO2"])

FEATURES: ['CO2']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       0.81      0.93      0.87      3933
         1.0       0.57      0.30      0.40      1207

    accuracy                           0.78      5140
   macro avg       0.69      0.62      0.63      5140
weighted avg       0.76      0.78      0.76      5140

ROC-AUC: 0.616

Wow it looks like Light is really good by itself!

Models using two features (Light and something else):

train_eval_logistic(df, features=["Light", "HumidityRatio"])

FEATURES: ['Light', 'HumidityRatio']
X: (20560, 2)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      3933
         1.0       0.95      1.00      0.97      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992

train_eval_logistic(df, features=["Light", "CO2"])

FEATURES: ['Light', 'CO2']
X: (20560, 2)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.99      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992

Which features would you choose for your final model?