The pandas
package makes it easy to calculate basic summary statistics.
Let’s consider this example dataset of monthly financial and economic indicators:
from pandas import read_csv
repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/monthly-indicators.csv"
df = read_csv(request_url)
df.head()
0 |
2024-05-01 |
314.069 |
5.33 |
525.6718 |
215.30 |
1 |
2024-04-01 |
313.548 |
5.33 |
500.3636 |
211.87 |
2 |
2024-03-01 |
312.332 |
5.33 |
521.3857 |
205.72 |
3 |
2024-02-01 |
310.326 |
5.33 |
504.8645 |
189.31 |
4 |
2024-01-01 |
308.417 |
5.33 |
479.8240 |
188.45 |
It contains over 200 months, spanning a time period from 2004 to 2024:
print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())
234
2004-12-01 ... 2024-05-01
We can use the dataframe’s describe
method to quickly see the basic summary statistics for each numeric column in the dataset:
count |
234.000000 |
234.000000 |
234.000000 |
234.000000 |
mean |
239.904709 |
1.588761 |
199.248881 |
124.362344 |
std |
30.299220 |
1.879657 |
122.947398 |
40.170234 |
min |
190.300000 |
0.050000 |
55.148800 |
41.650000 |
25% |
216.963500 |
0.120000 |
98.790475 |
101.575000 |
50% |
236.409000 |
0.390000 |
164.878300 |
123.360000 |
75% |
256.327500 |
2.480000 |
270.226075 |
159.787500 |
max |
314.069000 |
5.330000 |
525.671800 |
215.300000 |
This will show us the number of rows, mean and standard deviation, min and max, and quantiles for each column.
As you may be aware, we can alternatively calculate these metrics ourselves, using Series
aggregations:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
# https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html
series = df["fed"]
print("COUNT:", len(series))
print("MEAN:", series.mean().round(6))
print("STD:", series.std().round(6))
print("-------------")
print("MIN:", series.min())
print("25TH:", series.quantile(.25))
print("MED:", series.median())
print("75TH:", series.quantile(.75))
print("MAX:", series.max())
COUNT: 234
MEAN: 1.588761
STD: 1.879657
-------------
MIN: 0.05
25TH: 0.12
MED: 0.39
75TH: 2.48
MAX: 5.33
series.describe() # for comparison
count 234.000000
mean 1.588761
std 1.879657
min 0.050000
25% 0.120000
50% 0.390000
75% 2.480000
max 5.330000
Name: fed, dtype: float64
Distribution Plots
In order to learn more about the distribution of this data, we can create distribution plots, to tell a story about the summary statistics.
A box plot:
import plotly.express as px
# https://plotly.com/python-api-reference/generated/plotly.express.box.html
px.box(df, x="fed", orientation="h", points="all",
title="Distribution of Federal Funds Rate (Monthly)",
hover_data=["timestamp"], height=400
)
A violin plot:
# https://plotly.com/python-api-reference/generated/plotly.express.violin.html
px.violin(df, x="fed", orientation="h", points="all", box=True,
title="Distribution of Federal Funds Rate (Monthly)",
hover_data=["timestamp"], height=400
)
A histogram:
# https://plotly.com/python-api-reference/generated/plotly.express.histogram.html
px.histogram(df, x="fed", #nbins=12,
title="Distribution of Federal Funds Rate (Monthly)", height=400)
When we make a histogram, we can specify the number of bins, using the nbins
parameter.
These charts help us visually identify distributions in the data.