17 Multiclass Classification

17.1 Data Loading

In this chapter we will provide an example of multiclass classification, using the “Palmer Penguins” dataset.

Data Source

“Size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research (LTER) Program.”

Our goal will be to predict the species of a penguin, given features such as their weight, size, and gender.

Loading the dataset:

from palmerpenguins import load_penguins

df = load_penguins()
df.rename(columns={"sex": "gender"}, inplace=True)
df["species"] = df["species"].str.upper()
df["island"] = df["island"].str.upper()
df["gender"] = df["gender"].str.upper()
df.drop(columns=["year"], inplace=True)
df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	gender
0	ADELIE	TORGERSEN	39.1	18.7	181.0	3750.0	MALE
1	ADELIE	TORGERSEN	39.5	17.4	186.0	3800.0	FEMALE
2	ADELIE	TORGERSEN	40.3	18.0	195.0	3250.0	FEMALE
3	ADELIE	TORGERSEN	NaN	NaN	NaN	NaN	NaN
4	ADELIE	TORGERSEN	36.7	19.3	193.0	3450.0	FEMALE

As you can see, the target variable is categorical, which makes this a multiclass classification task:

df["species"].value_counts()

species
ADELIE       152
GENTOO       124
CHINSTRAP     68
Name: count, dtype: int64

17.1.1 Data Cleaning

17.1.2 Null Values

Checking for null values:

df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
gender               11
dtype: int64

Looks like there are some null values. Since the number of nulls is small, we can feel free to drop those rows from our analysis.

Dropping nulls:

print(len(df))

df.dropna(inplace=True)

print(len(df))

344
333

df.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
gender               0
dtype: int64

17.2 Data Exploration

Investigating features, such as the island:

df["island"].value_counts()

island
BISCOE       163
DREAM        123
TORGERSEN     47
Name: count, dtype: int64

What other variables are you interested in exploring?

17.2.1 Relationships

Let’s examine the weight of a penguin, by island, to see if we can notice any patterns in the data:

import plotly.express as px

px.scatter(df, x="body_mass_g", y="island", color="species",
  title="Body Mass by Penguin Species", height=300
)

It looks like island and weight might be suitable features, as they help differentiate between the various species. Specifically we can make the following observations:

Only one species of penguin lives on “Torgersen” island.
Two species of penguin live on “Biscoe” island, but these species are separable by their weight.
Two species of penguin live on “Dream” island, but the data aren’t separable by weight, so we might need additional features.

We are starting to develop an intuition that penguins on the “Dream” island might be the hardest to classify.

17.2.2 Pair Plots

Examining the relationships between each pair of variables:

from seaborn import pairplot

pairplot(df, hue="species", height=1.5) # height in inches

What relationships do you notice between specific pairs of variables? Which features could potentially help us separate or differentiate members of the different classes?

17.2.3 Correlation

Examining the correlation between each pair of variables, as a more formal measure of their relationships:

Code

from pandas import DataFrame
import plotly.express as px

def plot_correlation_matrix(df: DataFrame, method="pearson", height=450, showscale=True):
    """Params: method (str): "spearman" or "pearson".
    """

    cor_mat = df.corr(method=method, numeric_only=True)

    title= f"{method.title()} Correlation"

    fig = px.imshow(cor_mat,
                    height=height, # title=title,
                    text_auto= ".2f", # round to two decimal places
                    color_continuous_scale="Blues",
                    color_continuous_midpoint=0,
                    labels={"x": "Variable", "y": "Variable"},
    )
    # center title (h/t: https://stackoverflow.com/questions/64571789/)
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)
    fig.show()

plot_correlation_matrix(df, method="spearman", height=450)

What lessons are we learning about correlation of certain variables?

17.3 X/Y Split

Choosing the species as our target, and all the other variables as our features:

target = "species"

x = df.drop(columns=[target])
y = df[target]

print("X:", x.shape)
print("Y:", y.shape)

X: (333, 6)
Y: (333,)

17.4 Data Encoding

Since some features are categorical, we must encode them. Here we are using a one-hot-encoding approach:

from pandas import get_dummies as one_hot_encode

# since encoding transforms and removes the original columns,
# only try to encode if we haven't already done so:
if "island" in x.columns:
    x = one_hot_encode(x, columns=["island", "gender"], dtype=int)

print("X:", x.shape)
x.head()

X: (333, 9)

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	island_TORGERSEN	gender_FEMALE	gender_MALE
0	39.1	18.7	181.0	3750.0	1	0	1
1	39.5	17.4	186.0	3800.0	1	1	0
2	40.3	18.0	195.0	3250.0	1	1	0
4	36.7	19.3	193.0	3450.0	1	1	0
5	39.3	20.6	190.0	3650.0	1	0	1

Now that we have encoded the categorical features we can examine their correlation as well:

plot_correlation_matrix(x, method="spearman", height=550, showscale=False)

What lessons are we learning about the correlation of these additional variables?

17.5 Feature Scaling

Since we have multiple features, it may be helpful to scale them, to express their values using similar scales, and prevent some features from artificially overpowering the model. Here we are using a standard scaling approach:

x_scaled = (x - x.mean(axis=0)) / x.std(axis=0)
x_scaled.head()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	island_BISCOE	island_DREAM	island_TORGERSEN	gender_FEMALE	gender_MALE
0	-0.894695	0.779559	-1.424608	-0.567621	-0.977724	-0.76417	2.463094	-0.989542	0.989542
1	-0.821552	0.119404	-1.067867	-0.505525	-0.977724	-0.76417	2.463094	1.007534	-1.007534
2	-0.675264	0.424091	-0.425733	-1.188572	-0.977724	-0.76417	2.463094	1.007534	-1.007534
4	-1.333559	1.084246	-0.568429	-0.940192	-0.977724	-0.76417	2.463094	1.007534	-1.007534
5	-0.858123	1.744400	-0.782474	-0.691811	-0.977724	-0.76417	2.463094	-0.989542	0.989542

Verifying mean centering and unit variance:

x_scaled.describe().T[["mean", "std"]]

	mean	std
bill_length_mm	-9.388553e-16	1.0
bill_depth_mm	-1.237582e-15	1.0
flipper_length_mm	1.280257e-16	1.0
body_mass_g	-8.535048e-17	1.0
island_BISCOE	0.000000e+00	1.0
island_DREAM	6.401286e-17	1.0
island_TORGERSEN	0.000000e+00	1.0
gender_FEMALE	4.667604e-17	1.0
gender_MALE	6.534646e-17	1.0

17.6 Train/Test Split

Splitting the dataset into training and test sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=99)

print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)

TRAIN: (266, 9) (266,)
TEST: (67, 9) (67,)

17.7 Model Selection

Using a LogisticRegression model, which is the standard basic classification model we generally use for benchmarking purposes:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=99)

FYI - this model has some parameters we can tune later, if desired (see the LogisticRegression docs):

model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 99,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

17.8 Model Training

Training the model on the training data:

model.fit(x_train, y_train)

LogisticRegression(random_state=99)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Getting artifacts of the training process, specifically focusing on the coefficients:

model.coef_.shape

(3, 9)

Unlike binary classification in which there are a single set of coefficients, in multiclass classification there are a different set of coefficients for each of the classes present in our target variable. In other words, each feature may contribute differently to predicting members of the different species.

Wrapping the coefficients in a DataFrame, using corresponding column and index labels, allows us sort and compare them:

model.classes_

array(['ADELIE', 'CHINSTRAP', 'GENTOO'], dtype=object)

model.feature_names_in_

array(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm',
       'body_mass_g', 'island_BISCOE', 'island_DREAM', 'island_TORGERSEN',
       'gender_FEMALE', 'gender_MALE'], dtype=object)

from pandas import Series, DataFrame

coefs = DataFrame(model.coef_, columns=x_scaled.columns, index=model.classes_).T
coefs

	ADELIE	CHINSTRAP	GENTOO
bill_length_mm	-2.316749	1.877424	0.439324
bill_depth_mm	0.851392	0.149329	-1.000722
flipper_length_mm	-0.645044	-0.227886	0.872930
body_mass_g	-0.212658	-0.614652	0.827310
island_BISCOE	0.057675	-0.707674	0.649999
island_DREAM	-0.349307	0.905120	-0.555814
island_TORGERSEN	0.401405	-0.238630	-0.162775
gender_FEMALE	-0.394498	0.301477	0.093021
gender_MALE	0.394498	-0.301477	-0.093021

Examining the coefficients for each species:

coefs["ADELIE"].sort_values(ascending=False)

bill_depth_mm        0.851392
island_TORGERSEN     0.401405
gender_MALE          0.394498
island_BISCOE        0.057675
body_mass_g         -0.212658
island_DREAM        -0.349307
gender_FEMALE       -0.394498
flipper_length_mm   -0.645044
bill_length_mm      -2.316749
Name: ADELIE, dtype: float64

coefs["CHINSTRAP"].sort_values(ascending=False)

bill_length_mm       1.877424
island_DREAM         0.905120
gender_FEMALE        0.301477
bill_depth_mm        0.149329
flipper_length_mm   -0.227886
island_TORGERSEN    -0.238630
gender_MALE         -0.301477
body_mass_g         -0.614652
island_BISCOE       -0.707674
Name: CHINSTRAP, dtype: float64

coefs["GENTOO"].sort_values(ascending=False)

flipper_length_mm    0.872930
body_mass_g          0.827310
island_BISCOE        0.649999
bill_length_mm       0.439324
gender_FEMALE        0.093021
gender_MALE         -0.093021
island_TORGERSEN    -0.162775
island_DREAM        -0.555814
bill_depth_mm       -1.000722
Name: GENTOO, dtype: float64

How can we interpret these coefficients?

17.9 Model Evaluation

Now it’s time to evaluate the model.

Predicting values for the unseen test data:

y_pred = model.predict(x_test)

17.9.1 Classification Metrics

We are using our standard classification metrics, focusing on the classification report:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      ADELIE       1.00      0.97      0.99        35
   CHINSTRAP       0.91      1.00      0.95        10
      GENTOO       1.00      1.00      1.00        22

    accuracy                           0.99        67
   macro avg       0.97      0.99      0.98        67
weighted avg       0.99      0.99      0.99        67

How can we interpret these results? How did the model do overall? Are there any species that are easier or harder to classify than others?

17.9.2 Confusion Matrix

Using a confusion matrix to examine false positives and false negatives for each species:

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred, labels=model.classes_))

[[34  1  0]
 [ 0 10  0]
 [ 0  0 22]]

Plotting the confusion matrix, for easier interpretation:

Code

from sklearn.metrics import confusion_matrix
import plotly.express as px

def plot_confusion_matrix(y_true, y_pred, height=450, showscale=False, title=None, subtitle=None):
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Confusion matrix whose i-th row and j-th column
    # ... indicates the number of samples with
    # ... true label being i-th class (ROW)
    # ... and predicted label being j-th class (COLUMN)
    cm = confusion_matrix(y_true, y_pred)

    class_names = sorted(y_test.unique().tolist())

    cm = confusion_matrix(y_test, y_pred, labels=class_names)

    title = title or "Confusion Matrix"
    if subtitle:
        title += f"<br><sup>{subtitle}</sup>"

    fig = px.imshow(cm, x=class_names, y=class_names, height=height,
                    labels={"x": "Predicted", "y": "Actual"},
                    color_continuous_scale="Blues", text_auto=True,
    )
    fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
    fig.update_coloraxes(showscale=showscale)

    fig.show()

title = "Penguin Species Classifier - Confusion Matrix"
plot_confusion_matrix(y_test, y_pred, title=title, height=400)

17.9.3 ROC-AUC

To calculate the ROC for multiclass classification, we need to use the underlying predicted probabilities (“logits”) for each class:

y_pred_proba = model.predict_proba(x_test)
print(y_pred_proba.shape)

(67, 3)

Under the hood, the model assigns a different probability for each class. For its final prediction, the model chooses the class that has the highest likelihood (i.e. the “argmax”) for each row:

logits_df = DataFrame(y_pred_proba, columns=model.classes_)
logits_df["FINAL_PREDICTION"] = y_pred
logits_df.head()

	ADELIE	CHINSTRAP	GENTOO	FINAL_PREDICTION
0	0.027770	0.969363	0.002867	CHINSTRAP
1	0.998400	0.000149	0.001451	ADELIE
2	0.004052	0.995307	0.000641	CHINSTRAP
3	0.020223	0.978623	0.001155	CHINSTRAP
4	0.005153	0.000942	0.993904	GENTOO

Calculating the ROC score, using the predicted probabilities:

from sklearn.metrics import roc_auc_score

def compute_roc_auc_score(y_test, y_pred_proba, is_multiclass=True):
    """NOTE: roc_auc_score uses average='macro' by default"""

    if is_multiclass:
        return roc_auc_score(y_true=y_test, y_score=y_pred_proba, multi_class="ovr")
    else:
        y_pred_proba_pos = y_pred_proba[:,1] # positive class (for binary classification)
        return roc_auc_score(y_true=y_test, y_score=y_pred_proba_pos)



compute_roc_auc_score(y_test, y_pred_proba)

1.0

Feel free to apply these same techniques any time you are performing classification on a target that has multiple classes.