“Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month.”
The data is expressed in “Thousands”, and is “Not Seasonally Adjusted”.
Wrangling the data, including renaming columns and converting the date index to be datetime-aware, may make it easier for us to work with this data:
from pandas import to_datetimedf.rename(columns={DATASET_NAME: "population"}, inplace=True)df.index.name ="date"df.index = to_datetime(df.index)df
population
date
1959-01-01
175818.0
1959-02-01
176044.0
1959-03-01
176274.0
...
...
2024-07-01
337005.0
2024-08-01
337185.0
2024-09-01
337362.0
789 rows × 1 columns
9.2 Data Exploration
Exploring trends:
import plotly.express as pxpx.scatter(df, y="population", title="US Population (Monthly) vs Trend", labels={"population":"US Population (thousands)", "value":""}, trendline="ols", trendline_color_override="red", height=350,)
Looks like a possible linear trend. Let’s perform a more formal regression analysis.
9.3 Data Encoding
Because we need numeric features to perform a regression, we convert the dates to a linear time step of integers (after sorting the data first for good measure):
We will use the numeric time step as our input variable (x), to predict the population (y).
9.4 Data Splitting
9.4.1 X/Y Split
Identifying dependent and independent variables:
#x = df[["date"]] # we need numbers not stringsx = df[["time_step"]]y = df["population"]print("X:", x.shape)print("Y:", y.shape)
X: (789, 1)
Y: (789,)
9.4.2 Train/Test Split
Splitting data sequentially, where earlier data is used in training, and recent data is used for testing:
print(len(df))training_size =round(len(df) *.8)print(training_size)x_train = x.iloc[:training_size] # slice all beforey_train = y.iloc[:training_size] # slice all beforex_test = x.iloc[training_size:] # slice all aftery_test = y.iloc[training_size:] # slice all afterprint("TRAIN:", x_train.shape)print("TEST:", x_test.shape)
789
631
TRAIN: (631, 1)
TEST: (158, 1)
9.5 Model Selection and Training
Training a linear regression model on the training data:
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
After training, we have access to the learned weights, as well as the line of best fit (i.e. the trend line):
COEFS: [212.8667855]
INTERCEPT: 174578.54744547582
------------------
y = 212.867x + 174578.547
In this case, we interpret the line of best fit to observe how much the population is expected to grow on average per time step, as well as the population trend value at the earliest time step.
Note
Remember in this dataset the population is expressed in thousands.
9.6 Model Prediction and Evaluation
We use the trained model to make predictions on the test set, and then calculate regression metrics to see how well the model is doing:
from pandas import concatchart_df = concat([df, df_future])chart_df
population
time_step
prediction
error
date
1959-01-01
175818.0
1
174791.414231
1026.585769
1959-02-01
176044.0
2
175004.281016
1039.718984
1959-03-01
176274.0
3
175217.147802
1056.852198
...
...
...
...
...
2027-07-01
NaN
823
349767.911913
NaN
2027-08-01
NaN
824
349980.778699
NaN
2027-09-01
NaN
825
350193.645484
NaN
825 rows × 4 columns
Note
The population and error values for future dates are null, because we don’t know them yet. Although we are able to make predictions about these values, based on historical trends.
Plotting trend vs actual, with future predictions:
px.line(chart_df[-180:], y=["population", "prediction"], height=350, title="US Population (Monthly) vs Regression Predictions (Trend)", labels={"value":""})