Regression with Multiple Features

So far we have covered an example of regression using a single feature variable to predict the target variable.

However in practice, it is common to use multiple features, each of which may contribute to improving the model’s performance.

Considerations

When working with multiple features, there is a trade-off between model performance and model complexity. A model with billions of features, and consequently billions of parameters, can be slower to train and may lead to increased storage and computational costs when deployed. In many cases, a simpler model with fewer features that performs nearly as well can be preferable, especially if it offers faster training, lower deployment costs, and improved interpretability. This trade-off between model complexity and performance should be evaluated based on the specific requirements of the use case, such as the need for speed, scalability, or accuracy.

As previously discussed, one consideration when using multiple features is the potential need to perform data scaling, to standardize the scale of all the features, and ensure features with large values aren’t dominating the model. Although, for linear regression specifically, data scaling is not as important.

Another important factor to keep in mind when using multiple features is the concept of collinearity, which occurs when two or more predictor variables are highly correlated with each other. This can lead to redundancy, where the correlated predictors provide overlapping information about the target variable. Collinearity can cause problems in regression models, such as unstable coefficients, difficulty in interpreting results, and increased sensitivity to small changes in the data. So we should examine the relationships between features before choosing which features to include in the final model.

Data Loading

For an example regression dataset that has multiple features, let’s consider this dataset of california housing prices, from the sklearn.datasets sub-module:

from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing(as_frame=True)
print(type(dataset))
print(dataset.keys())

<class 'sklearn.utils._bunch.Bunch'>
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

Data Source

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). A household is a group of people residing within a home.

source

After reading the dataset description, we see features like latitude, longitude, population, and income describe the census block. Whereas age, rooms, bedrooms, occupants, and value describe the homes in that census block.

Our goal is to use the features to predict a target of home value.

Accessing the data, and renaming and reordering the columns for convenience:

df = dataset.frame
# rename columns:
df.rename(columns={
    "MedInc": "income", # median income in block group (in)
    "HouseAge": "age", # median house age in block group
    "AveRooms": "rooms", # average number of rooms per household
    "AveBedrms": "bedrooms", # average number of bedrooms per household
    "Population": "population", # block group population
    "AveOccup": "occupants",  # average number of household members
    "Latitude": "latitude", # block group latitude
    "Longitude": "longitude", # block group longitude
    "MedHouseVal": "value" # median house value (in $100K)
}, inplace=True)
# reorder columns :
df = df[["latitude", "longitude", "population", "income", "age", "rooms", "bedrooms", "occupants", "value"]]
df.head()

	latitude	longitude	population	income	age	rooms	bedrooms	occupants	value
0	37.88	-122.23	322.0	8.3252	41.0	6.984127	1.023810	2.555556	4.526
1	37.86	-122.22	2401.0	8.3014	21.0	6.238137	0.971880	2.109842	3.585
2	37.85	-122.24	496.0	7.2574	52.0	8.288136	1.073446	2.802260	3.521
3	37.85	-122.25	558.0	5.6431	52.0	5.817352	1.073059	2.547945	3.413
4	37.85	-122.25	565.0	3.8462	52.0	6.281853	1.081081	2.181467	3.422

Data Exploration

Distributions

Examining the distribution of the target variable:

import plotly.express as px

px.violin(df, x="value", #points="all",
          box=True, height=350,
          title="Distribution of Housing Prices",
          labels = {"value": "Median Housing Price"}
)

px.histogram(df, x="value", height=350,
            title="Distribution of Housing Prices",
            labels = {"value": "Median Housing Price"}
)

It appears there are some outlier homes at the very expensive end, which we could possibly consider dropping.

df.sort_values(by="value", ascending=False).head(5)

	latitude	longitude	population	income	age	rooms	bedrooms	occupants	value
10667	33.55	-117.88	772.0	10.1275	27.0	7.493421	1.039474	2.539474	5.00001
16916	37.57	-122.34	967.0	7.0422	52.0	7.045455	1.090909	2.585561	5.00001
16946	37.55	-122.33	827.0	6.1349	33.0	6.893417	0.978056	2.592476	5.00001
8877	34.04	-118.49	859.0	8.0257	48.0	7.781046	1.127451	2.807190	5.00001
8878	34.04	-118.50	1143.0	15.0001	52.0	8.000000	0.997333	3.048000	5.00001

Relationships

Let’s examine the relationships between variables, to start to build an intuition about which features may be related to the target.

Examining the relationship between average income and median house price:

px.scatter(df, x="income", y="value", height=450, #width=650,
    title="Median Housing Price by Average Income",
    trendline="ols", trendline_color_override="green",
    color="value", color_continuous_scale=px.colors.sequential.YlGn,
)

Examining the relationship between geographic area (latitude and longitude) and the median housing price:

fig = px.scatter_mapbox(df, lat="latitude", lon="longitude",
    title="Median Housing Price by Lat/Long",
    mapbox_style="open-street-map",
    zoom=4, height=550, width=650,
    color="value", color_continuous_scale=px.colors.sequential.YlGn,
)
fig.show(config={'scrollZoom': True})

Interactive dataviz

Zoom and pan the map to find the areas with the most expensive homes.

We see the most expensive homes are on the coast. So we can consider using latitude and longitude as features in our model.

Pair Plots

One way to visualize the relationships between each combination of variables is using the pairplots function from seaborn, however in practice this can take a long time to finish.

from seaborn import pairplot

# using all the data (might take a long time):
#pairplot(df, hue="value")

# taking sample of rows and less columns helps the plot complete faster:
df_sample = df.sample(100, random_state=99)
pairplot(df_sample[["rooms","income", "population", "value"]], height=1.3)

Correlation

Let’s examine the correlation between the target and each of the features, as well as between each pair of features:

cor_mat = df.corr(method="spearman") # numeric_only=True

title = "Spearman Correlation between Variables in Housing Dataset"
fig = px.imshow(cor_mat, height=600, text_auto= ".2f",
                color_continuous_scale="Blues", color_continuous_midpoint=0,
                labels={"x": "Variable", "y": "Variable"}
)
fig.update_layout(title={'text': title, 'x':0.485, 'xanchor': 'center'})
fig.show()

It looks like there is the highest correlation between the target (median home value) and the median income. So we will probably want to keep income as a feature in our model.

There is also high correlation between rooms and income, which makes sense if there are larger houses in areas of higher income. Because these features are highly correlated with each other, we can consider only using one of them in our model, to address collinearity concerns.

X/Y Split

x = df.drop(columns=["value"])
y = df["value"]
print("X:", x.shape)
print("Y:", y.shape)

X: (20640, 8)
Y: (20640,)

Data Scaling

x_scaled = (x - x.mean()) / x.std()
print("--------")
print("SCALED MEANS:")
print(x_scaled.mean())
print("--------")
print("SCALED STDEV:")
print(x_scaled.std())

--------
SCALED MEANS:
latitude     -9.914550e-16
longitude    -8.570577e-15
population   -1.927829e-17
income        7.711317e-17
age           3.855658e-17
rooms         7.711317e-17
bedrooms     -1.046536e-16
occupants     6.885104e-19
dtype: float64
--------
SCALED STDEV:
latitude      1.0
longitude     1.0
population    1.0
income        1.0
age           1.0
rooms         1.0
bedrooms      1.0
occupants     1.0
dtype: float64

Train/Test Split

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)

TRAIN: (15480, 8) (15480,)
TEST: (5160, 8) (5160,)

Model Training

Training a linear regression model on the training data:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Examining the coefficients:

from pandas import Series

coefs = Series(model.coef_, index=x.columns)
coefs.name = "Housing Regression Model Coefficients"
coefs.sort_values(ascending=False)

income        0.819154
bedrooms      0.276513
age           0.123777
population   -0.001711
occupants    -0.035253
rooms        -0.240811
longitude    -0.878116
latitude     -0.908597
Name: Housing Regression Model Coefficients, dtype: float64

We see the coefficients with the highest magnitude are income (positive 0.8), and latitude and longitude (each around negative 0.9). These features are contributing the most in explaining the target.

Training metrics:

Code

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def regression_report(y_true, y_pred):
    """Displays regression metrics given predicted and actual values."""
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)

    print("R2:", round(r2, 3))
    print("MAE:", mae.round(3))
    print("MSE:", mse.round(3))
    print("RMSE:", (mse ** 0.5).round(3))

y_pred_train = model.predict(x_train)

print("TRAINING METRICS:")
regression_report(y_train, y_pred_train)

TRAINING METRICS:
R2: 0.604
MAE: 0.538
MSE: 0.533
RMSE: 0.73

Model Evaluation

Test metrics:

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

y_pred = model.predict(x_test)

print("TEST METRICS:")
regression_report(y_test, y_pred)

TEST METRICS:
R2: 0.613
MAE: 0.519
MSE: 0.499
RMSE: 0.706

We see an r-squared score of around 0.61 for the baseline model using all features.

Feature Selection

OK, so we’ve trained a model using all available features, and examined the coefficients to see which features are most predictive. But do we need all the features? Let’s consider which features will give us the most “bang for our buck”, as we explore trade-offs between model performance and model complexity.

To perform this experiment without proliferating lots of duplicate code, here we are abstracting all the logic into a custom function called train_eval, which will accept a list of features as a parameter input. This will allow us to test different combinations of features.

Code

from pandas import DataFrame
from sklearn.linear_model import LinearRegression
from pandas import Series
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


def train_eval(df:DataFrame, target="value", features=None, scale=True):
    """Trains a linear regression model on a dataset
        for a given target variable and list of features.
        Uses all features in the dataframe by default.
    """

    # X/Y SPLIT

    if features:
        x = df[features]
    else:
        x = df.drop(columns=[target])

    y = df[target]
    print("FEATURES:", x.columns.tolist())

    # SCALING

    if scale:
        x = (x - x.mean()) / x.std()

    # TRAIN/TEST SPLITT

    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)

    # MODEL TRAINING

    model = LinearRegression()
    model.fit(x_train, y_train)

    print("--------")
    print("COEFS:")
    coefs = Series(model.coef_, index=x.columns)
    print(coefs.sort_values(ascending=False))


    print("--------")
    y_pred_train = model.predict(x_train)
    r2_train = r2_score(y_train, y_pred_train)
    print("R2 (TRAIN):", round(r2_train, 3))

    # EVALUATION

    print("--------")
    y_pred = model.predict(x_test)
    r2 = r2_score(y_test, y_pred)
    print("R2 (TEST):", round(r2, 3))

As we saw earlier, our baseline model (using all the features) gets us an r-squared of around 60%.

train_eval(df)

FEATURES: ['latitude', 'longitude', 'population', 'income', 'age', 'rooms', 'bedrooms', 'occupants']
--------
COEFS:
income        0.819154
bedrooms      0.276513
age           0.123777
population   -0.001711
occupants    -0.035253
rooms        -0.240811
longitude    -0.878116
latitude     -0.908597
dtype: float64
--------
R2 (TRAIN): 0.604
--------
R2 (TEST): 0.613

We can start by using a single feature, and build from there.

We saw earlier how income is most highly correlated with the target, and its coefficient was high in magnitude. This would be a great feature to start with.

train_eval(df, features=["income"])

FEATURES: ['income']
--------
COEFS:
income    0.793869
dtype: float64
--------
R2 (TRAIN): 0.472
--------
R2 (TEST): 0.479

We saw earlier a linear relationship between income and bedrooms, so it’s no surprise adding bedrooms to the model does not provide much “lift” (i.e. help improve performance):

train_eval(df, features=["income", "bedrooms"])

FEATURES: ['income', 'bedrooms']
--------
COEFS:
income      0.793568
bedrooms   -0.004588
dtype: float64
--------
R2 (TRAIN): 0.472
--------
R2 (TEST): 0.479

Three was a linear relationship between bedrooms and rooms, so we see similar results adding rooms as a feature:

train_eval(df, features=["income", "rooms"])

FEATURES: ['income', 'rooms']
--------
COEFS:
income    0.820176
rooms    -0.081325
dtype: float64
--------
R2 (TRAIN): 0.477
--------
R2 (TEST): 0.487

Using rooms and bedrooms improves performance a bit, but due to collinearity we probably wouldn’t want to use them both.

train_eval(df, features=["income", "rooms", "bedrooms"])

FEATURES: ['income', 'rooms', 'bedrooms']
--------
COEFS:
income      1.003513
bedrooms    0.472005
rooms      -0.552543
dtype: float64
--------
R2 (TRAIN): 0.504
--------
R2 (TEST): 0.524

What about geographic region only?

train_eval(df, features=["latitude", "longitude"])

FEATURES: ['latitude', 'longitude']
--------
COEFS:
longitude   -1.433114
latitude    -1.491658
dtype: float64
--------
R2 (TRAIN): 0.242
--------
R2 (TEST): 0.244

The features with the highest magnitude coefficients are income, latitude, and longitude, so we can see the results from using just these features.

train_eval(df, features=["income", "latitude", "longitude"])

FEATURES: ['income', 'latitude', 'longitude']
--------
COEFS:
income       0.695895
longitude   -1.004133
latitude    -1.039170
dtype: float64
--------
R2 (TRAIN): 0.582
--------
R2 (TEST): 0.59

Just these three features give us more or less the same amount of predictive ability as using all the features. So if we take into account model complexity, we might choose these three as the final set of features.