Summary Statistics

The pandas package makes it easy to calculate basic summary statistics.

Let’s consider this example dataset of monthly financial and economic indicators:

from pandas import read_csv

df = read_csv("https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-indicators.csv")
df.head()
timestamp cpi fed spy gld
0 2024-05-01 314.069 5.33 525.6718 215.30
1 2024-04-01 313.548 5.33 500.3636 211.87
2 2024-03-01 312.332 5.33 521.3857 205.72
3 2024-02-01 310.326 5.33 504.8645 189.31
4 2024-01-01 308.417 5.33 479.8240 188.45

It contains over 200 months, spanning a time period from 2004 to 2024:

print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())
234
2004-12-01 ... 2024-05-01

We can use the dataframe’s describe method to quickly see the basic summary statistics for each numeric column in the dataset:

df.describe()
cpi fed spy gld
count 234.000000 234.000000 234.000000 234.000000
mean 239.904709 1.588761 199.248881 124.362344
std 30.299220 1.879657 122.947398 40.170234
min 190.300000 0.050000 55.148800 41.650000
25% 216.963500 0.120000 98.790475 101.575000
50% 236.409000 0.390000 164.878300 123.360000
75% 256.327500 2.480000 270.226075 159.787500
max 314.069000 5.330000 525.671800 215.300000

This will show us the number of rows, mean and standard deviation, min and max, and quantiles for each column.

As you may be aware, we can alternatively calculate these metrics ourselves, using Series aggregations:

# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
# https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html

series = df["fed"]

print("COUNT:", len(series))
print("MEAN:", series.mean().round(6))
print("STD:", series.std().round(6))
print("-------------")
print("MIN:", series.min())
print("25TH:", series.quantile(.25))
print("MED:", series.median())
print("75TH:", series.quantile(.75))
print("MAX:", series.max())
COUNT: 234
MEAN: 1.588761
STD: 1.879657
-------------
MIN: 0.05
25TH: 0.12
MED: 0.39
75TH: 2.48
MAX: 5.33
series.describe() # for comparison
count    234.000000
mean       1.588761
std        1.879657
min        0.050000
25%        0.120000
50%        0.390000
75%        2.480000
max        5.330000
Name: fed, dtype: float64

Distribution Plots

In order to learn more about the distribution of this data, we can create distribution plots, to tell a story about the summary statistics.

A box plot:

import plotly.express as px

# https://plotly.com/python-api-reference/generated/plotly.express.box.html
px.box(df, x="fed", orientation="h", points="all",
        title="Distribution of Federal Funds Rate (Monthly)",
        hover_data=["timestamp"]
)

A violin plot:

# https://plotly.com/python-api-reference/generated/plotly.express.violin.html
px.violin(df, x="fed", orientation="h", points="all", box=True,
        title="Distribution of Federal Funds Rate (Monthly)",
        hover_data=["timestamp"]
)

A histogram:

# https://plotly.com/python-api-reference/generated/plotly.express.histogram.html
px.histogram(df, x="fed", #nbins=12,
        title="Distribution of Federal Funds Rate (Monthly)", height=350)

When we make a histogram, we can specify the number of bins, using the nbins parameter.

These charts help us visually identify distributions in the data.

Based on this view, is hard to say for sure if this data is normally distributed, or multi-modal, or whether it is too skewed by the outliers. In the next chapter, we will perform more official statistical tests to determine if this data is normally distributed.