16  Binary Classification

16.1 Data Loading

To illustrate binary classification, we’ll use an “Occupancy Detection” dataset dataset.

TipData Source

“Experimental data used for binary classification (room occupancy) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.”

Loading the dataset:

from ucimlrepo import fetch_ucirepo

ds = fetch_ucirepo(id=357)

Inspecting the variables:

ds.variables
name role type demographic description units missing_values
0 id ID Integer None None None no
1 date Feature Date None None None no
2 Temperature Feature Integer None None C no
3 Humidity Feature Continuous None None % no
4 Light Feature Integer None None Lux no
5 CO2 Feature Continuous None None ppm no
6 HumidityRatio Feature Continuous None None kgwater-vapor/kg-air no
7 Occupancy Target Binary None 0 for not occupied, 1 for occupied status None no
TipData Dictionary

Variable Info (paraphrased from the UCI website):

  • Date: time in format of “year-month-day hour:minute:second”
  • Temperature: in Celsius
  • Humidity: relative humidity, as a percentage %
  • Light: in Lux
  • CO2: in ppm
  • HumidityRatio: derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
  • Occupancy: 0 or 1 (0 for not occupied, 1 for occupied)

Loading the data:

df = ds["data"]["original"].copy()
df.rename(columns={"date": "Date", "Occupancy": "Occupied"}, inplace=True)
df.drop(columns=["id"], inplace=True)
df.head()
Date Temperature Humidity Light CO2 HumidityRatio Occupied
0 2015-02-04 17:51:00 23.18 27.272 426 721.25 0.00479298817650529 1.0
1 2015-02-04 17:51:59 23.15 27.2675 429.5 714 0.00478344094931065 1.0
2 2015-02-04 17:53:00 23.15 27.245 426 713.5 0.00477946352442199 1.0
3 2015-02-04 17:54:00 23.15 27.2 426 708.25 0.00477150882608175 1.0
4 2015-02-04 17:55:00 23.1 27.2 426 704.5 0.00475699293331518 1.0

16.2 Data Cleaning

16.2.1 Null Values

Checking for nulls:

df.isnull().sum()
Date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupied         2
dtype: int64

Dropping two rows that have null values for the target variable:

print(len(df))

df.dropna(inplace=True)

print(len(df))
20562
20560
df.isnull().sum()
Date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupied         0
dtype: int64

16.2.2 Datatypes

Upon investigation and performing later analysis, we learn the numerical data is actually currently in a string format (represented by pandas as an “object”):

df.dtypes
Date              object
Temperature       object
Humidity          object
Light             object
CO2               object
HumidityRatio     object
Occupied         float64
dtype: object

So we need to clean the data before moving on:

from pandas import to_numeric

numeric_features = ["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"]
df[numeric_features] = df[numeric_features].apply(to_numeric)

df.dtypes
Date              object
Temperature      float64
Humidity         float64
Light            float64
CO2              float64
HumidityRatio    float64
Occupied         float64
dtype: object

16.3 Data Exploration

16.3.1 Distribution of the Target

target = "Occupied"
#df[target] = df[target].map({0: False, 1: True})

As you can see, the target variable is binary, which makes this a binary classification task:

df[target].value_counts()
Occupied
0.0    15810
1.0     4750
Name: count, dtype: int64
import plotly.express as px

px.histogram(df, x=target, nbins=5, height=350, title="Distribution of Occupancy")

It doesn’t look like the classes are prohibitively imbalanced.

16.3.2 Relationships

Investigating the relationships between certain variables of interest, to start developing an intuition for the data.

#px.scatter(df, x="Light", y=target, height=350,
#           trendline="ols", trendline_color_override="red"
#)

Plotting the distribution of light, grouped by occupancy status:

px.histogram(df, x="Light", nbins=7, height=350,
             facet_col=target, color=target
            )

What can we learn about the relationship between light and occupancy?

Plotting the distribution of temperature, grouped by occupancy status:

px.histogram(df, x="Temperature", nbins=7, height=350,
             facet_col=target, color=target
            )

What can we learn about the relationship between temperature and occupancy?

16.3.3 Correlation

Helper function for plotting correlation matrix as a heatmap:

Code
import plotly.express as px

def plot_correlation_matrix(df, method="pearson", height=450, showscale=True):
    """Params: method (str): "spearman" or "pearson". """

    cor_mat = df.corr(method=method, numeric_only=True)

    title= f"{method.title()} Correlation"

    fig = px.imshow(cor_mat,
                    height=height, # title=title,
                    text_auto= ".2f", # round to two decimal places
                    color_continuous_scale="Blues",
                    color_continuous_midpoint=0,
                    labels={"x": "Variable", "y": "Variable"},
    )
    # center title (h/t: https://stackoverflow.com/questions/64571789/)
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()

Plotting correlation to gain a more formal understanding of the relationships:

plot_correlation_matrix(df, method="spearman", height=450)

The feature that is most highly correlated with the target is Light. So we should probably include that feature in our model.

The features that are most highly correlated with each other are Humidity and HumidityRatio. After consulting the data dictionary in more detail, we realize that HumidityRatio was derived from Temperature and Humidity. So for our model, to alleviate collinearity concerns, we should either include one or more of the original features (Temperature and Humidity), or the derived feature (HumidityRatio), but not both.

corr_target = df.corr(numeric_only=True)[target].sort_values(ascending=False)
corr_target
Occupied         1.000000
Light            0.914850
Temperature      0.555610
CO2              0.501582
HumidityRatio    0.257324
Humidity         0.046240
Name: Occupied, dtype: float64

16.4 X/Y Split

# df.columns.tolist()

Choosing target and feature variables:

target = "Occupied"
y = df[target].copy()

x = df.drop(columns=[target, "Date", "Humidity", "Temperature"]).copy()
print("X:", x.shape)
print("Y:", y.shape)
X: (20560, 3)
Y: (20560,)
x.head()
Light CO2 HumidityRatio
0 426.0 721.25 0.004793
1 429.5 714.00 0.004783
2 426.0 713.50 0.004779
3 426.0 708.25 0.004772
4 426.0 704.50 0.004757

16.5 Feature Scaling

Scaling the features, to express their values on a similar scale:

x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)
x_scaled.head()
Light CO2 HumidityRatio
0 1.403042 0.098639 0.735378
1 1.419675 0.075343 0.722945
2 1.403042 0.073736 0.717765
3 1.403042 0.056866 0.707405
4 1.403042 0.044816 0.688501

Verifying mean centering and unit variance:

x_scaled.describe().T[["mean", "std"]]
mean std
Light 1.935330e-17 1.0
CO2 2.432987e-16 1.0
HumidityRatio 6.082467e-16 1.0

16.6 Train Test Split

Splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (15420, 3) (15420,)
TEST: (5140, 3) (5140,)

16.7 Model Training

Training the model on the training data:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=99)
model.fit(x_train, y_train)
LogisticRegression(random_state=99)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Examining coefficients:

from pandas import Series

coef = Series(model.coef_[0], index=x_train.columns)
coef.sort_values(ascending=False)
Light            4.611785
CO2              0.944159
HumidityRatio    0.231324
dtype: float64

What can we learn from these coefficients?

16.8 Model Evaluation

Predicting values for the unseen test data:

y_pred = model.predict(x_test)

Evaluating the model’s performance using the normal classification metrics:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

Computing the ROC-AUC score as an additional classification metric:

from sklearn.metrics import roc_auc_score

print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))
ROC-AUC: 0.992

Alright, looks like our model is doing really great overall!

16.8.1 Confusion Matrix

Helper function for plotting a confusion matrix as a heatmap:

Code
from sklearn.metrics import confusion_matrix
import plotly.express as px

def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Confusion matrix whose i-th row and j-th column
    # ... indicates the number of samples with
    # ... true label being i-th class (ROW)
    # ... and predicted label being j-th class (COLUMN)
    cm = confusion_matrix(y_true, y_pred)

    class_names = sorted(y_test.unique().tolist())

    cm = confusion_matrix(y_test, y_pred, labels=class_names)

    title = title or "Confusion Matrix"
    #if subtitle:
    #    title += f"<br><sup>{subtitle}</sup>"

    fig = px.imshow(cm, x=class_names, y=class_names, height=height,
                    labels={"x": "Predicted", "y": "Actual"},
                    color_continuous_scale="Blues", text_auto=True,
    )
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()

Examining predicted vs actual values:

plot_confusion_matrix(y_test, y_pred, height=400)

Is one of the classes getting mis-classified more than the other?

16.9 Complexity vs Performance

Now that we have an idea for the performance of a model that uses all available features, let’s try different combinations of features to see if there is a simpler model that performs almost as well.

Helper function for training and evaluating a model given a list of features:

Code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pandas import Series
from sklearn.metrics import classification_report, roc_auc_score

def train_eval_logistic(df, target="Occupied", features=[]):
    if not any(features):
        features = df.drop(columns=[target]).columns.tolist()
    print("FEATURES:", features)

    x = df[features].copy()
    print("X:", x.shape)

    y = df[target].copy()
    print("Y:", y.shape)

    # SCALING:
    x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)

    # TRAIN / TEST SPLIT:
    x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
    # MODEL TRAINING:
    model = LogisticRegression(random_state=99)
    model.fit(x_train, y_train)

    #print("COEFS:")
    #coef = Series(model.coef_[0], index=x_train.columns)
    #print(coef.sort_values(ascending=False))

    # PREDS AND EVAL:
    y_pred = model.predict(x_test)

    print(classification_report(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_pred).round(3))

Model performance using all original features:

train_eval_logistic(df, features=["Light", "Temperature", "Humidity", "CO2"])
FEATURES: ['Light', 'Temperature', 'Humidity', 'CO2']
X: (20560, 4)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.99      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.993

Model performance using the derived feature:

train_eval_logistic(df, features=["Light", "HumidityRatio", "CO2"])
FEATURES: ['Light', 'HumidityRatio', 'CO2']
X: (20560, 3)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992

Models using just a single-feature:

train_eval_logistic(df, features=["Light"])
FEATURES: ['Light']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      3933
         1.0       0.95      1.00      0.97      1207

    accuracy                           0.99      5140
   macro avg       0.97      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.991
train_eval_logistic(df, features=["HumidityRatio"])
FEATURES: ['HumidityRatio']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       0.77      1.00      0.87      3933
         1.0       0.86      0.05      0.10      1207

    accuracy                           0.78      5140
   macro avg       0.82      0.52      0.48      5140
weighted avg       0.79      0.78      0.69      5140

ROC-AUC: 0.524
train_eval_logistic(df, features=["CO2"])
FEATURES: ['CO2']
X: (20560, 1)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       0.81      0.93      0.87      3933
         1.0       0.57      0.30      0.40      1207

    accuracy                           0.78      5140
   macro avg       0.69      0.62      0.63      5140
weighted avg       0.76      0.78      0.76      5140

ROC-AUC: 0.616

Wow it looks like Light is really good by itself!

Models using two features (Light and something else):

train_eval_logistic(df, features=["Light", "HumidityRatio"])
FEATURES: ['Light', 'HumidityRatio']
X: (20560, 2)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      3933
         1.0       0.95      1.00      0.97      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.98      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992
train_eval_logistic(df, features=["Light", "CO2"])
FEATURES: ['Light', 'CO2']
X: (20560, 2)
Y: (20560,)
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      3933
         1.0       0.96      1.00      0.98      1207

    accuracy                           0.99      5140
   macro avg       0.98      0.99      0.99      5140
weighted avg       0.99      0.99      0.99      5140

ROC-AUC: 0.992

Which features would you choose for your final model?