Data Scaling

Data scaling is a data processing technique used to adjust the range of features (input variables) in a dataset to a common scale, without distorting differences in the relative ranges of values.

To illustrate the motivations behind data scaling, let’s revisit our familiar dataset of economic indicators:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/monthly-indicators.csv"

df = read_csv(request_url)
df.head()
timestamp cpi fed spy gld
0 2024-05-01 314.069 5.33 525.6718 215.30
1 2024-04-01 313.548 5.33 500.3636 211.87
2 2024-03-01 312.332 5.33 521.3857 205.72
3 2024-02-01 310.326 5.33 504.8645 189.31
4 2024-01-01 308.417 5.33 479.8240 188.45
print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())
234
2004-12-01 ... 2024-05-01

When we plot this data on a graph, we are not able to see the movement of the federal funds rate, because its scale is so much smaller than the other indicators:

import plotly.express as px

chart_df = df.copy()
chart_df.index = df["timestamp"]
chart_df.drop(columns=["timestamp"], inplace=True)

px.line(chart_df, y=["cpi", "fed", "spy", "gld"],
        title="Financial indicators over time (unscaled)"
)

Let’s fix this by scaling the data.

Scaling the data will make it easier to plot all these different series on a graph, so we can start to get a sense of how their movements might correlate (in an unofficial way).

Data Scaling Techniques

Min-Max Scaling

One scaling approach called min-max scaling calls for dividing each value over the maximum value in that column, essentially expressing each value as a percentage of the greatest value.

scaled_df = df.copy()
scaled_df.index = df["timestamp"]
scaled_df.drop(columns=["timestamp"], inplace=True)

# MIN-MAX SCALING:
# dividing each value by that column's maximum value
scaled_df = scaled_df / scaled_df.max()

px.line(scaled_df, y=["cpi", "fed", "spy", "gld"],
        title="Financial indicators over time (min-max scaled)"
)

When we use min-max scaling, resulting values will be expressed on a scale between zero and one.

Standard Scaling

An alternative, more rigorous, scaling approach, called standard scaling or z-score normalization, mean-centers the data and normalizes by the standard deviation:

scaled_df = df.copy()
scaled_df.index = df["timestamp"]
scaled_df.drop(columns=["timestamp"], inplace=True)

# STANDARD SCALING:
# standardization / normalization
scaled_df = (scaled_df - scaled_df.mean()) / scaled_df.std()

px.line(scaled_df, y=["cpi", "fed", "spy", "gld"],
        title="Financial indicators over time (standard/z-score scaled)"
)

When we use standard scaling, resulting values will be expressed on a scale which is centered around zero.

Now that we have scaled the data, we can more easily compare the movements of all the datasets. Which indicators have been moving up or down at a time when another indicator has been moving up or down. Are there any time periods where we might start to suspect correlation in a positive or negative way?

Importance for Machine Learning

Data scaling is relevant in machine learning when using multiple input features that have different scales. In these situations, scaling ensures different features contribute proportionally to the model during training. If features are not scaled, those features with larger ranges may disproportionately influence the model, leading to biased predictions or slower convergence during training.

Some models, such as ordinary least squares linear regression, are less sensitive to the range of input data, and with these models, scaling the data may not make a noticeable difference in performance.

However other algorithms, especially those which utilize distance-based calculations (e.g. Support Vector Machines, or K-Nearest Neighbors) or regularization methods (e.g. Ridge or Lasso regression), are particularly sensitive to the range of the input data and perform better when all features are on a similar scale.

By scaling data, using techniques like min-max scaling or standard scaling, we ensure that each feature contributes equally, improving model performance, training efficiency, and the accuracy of predictions.