14  Datatype Conversions

14.1 Converting to Numeric

For this example, we will use a dataset of publicly traded companies, from the Nasdaq:

Code
from pandas import read_csv

request_url = "https://raw.githubusercontent.com/prof-rossetti/applied-data-science-python-book/refs/heads/main/docs/data/nasdaq_screener_1735923271750.csv"

df = read_csv(request_url)
df.rename(columns={"% Change": "Pct Change", "Last Sale": "Latest Close"}, inplace=True)
df.drop(columns=["Market Cap", "Country", "IPO Year", "Volume", "Industry", "Sector"], inplace=True)
df.head()
Symbol Name Latest Close Net Change Pct Change
0 A Agilent Technologies Inc. Common Stock $134.125 0.695 0.521%
1 AA Alcoa Corporation Common Stock $35.77 -2.220 -5.844%
2 AACG ATA Creativity Global American Depositary Shares $0.901 0.021 2.386%
3 AACT Ares Acquisition Corporation II Class A Ordina... $10.97 0.000 0.00%
4 AADI Aadi Bioscience Inc. Common Stock $3.16 0.020 0.637%

In this case we see the “Latest Close” column contains values like “$134.125” (including the dollar sign), and the “Pct Change” column contains values like “2.386%” (including the percent sign). These values are likely to be string datatypes. This unfortunately would prevent us from performing numeric calculations with the values in these columns.

We can more formally inspect the datatypes of each column, using the dtypes property:

df.dtypes
Symbol           object
Name             object
Latest Close     object
Net Change      float64
Pct Change       object
dtype: object

Here we see the datatype for the “Latest Close” and “Pct Change” columns are both a generic “object”, which is often used to represent strings. However we need to convert them to numeric datatypes instead.

Note

With pandas datatypes, the “object” datatype “can hold any Python object, including strings.”

To disambiguate, we can always ask for the datatype of one of the values itself:

val = df["Pct Change"].values[0]
print(val)
print(type(val))
0.521%
<class 'str'>

Here we see these values are indeed string datatypes.

We can use the to_numeric function from pandas to convert a string datatype to numeric:

from pandas import to_numeric

df["Latest Close Numeric"] = to_numeric(df["Latest Close"].str.lstrip("$"))

df["Pct Change Numeric"] = (to_numeric(df["Pct Change"].str.rstrip("%")) / 100)

df[["Symbol", "Latest Close", "Latest Close Numeric", "Pct Change", "Pct Change Numeric"]].head()
Symbol Latest Close Latest Close Numeric Pct Change Pct Change Numeric
0 A $134.125 134.125 0.521% 0.00521
1 AA $35.77 35.770 -5.844% -0.05844
2 AACG $0.901 0.901 2.386% 0.02386
3 AACT $10.97 10.970 0.00% 0.00000
4 AADI $3.16 3.160 0.637% 0.00637
Tip

In this example we are using string methods such as lstrip and rstrip to remove a character from the beginning or end of a string, respectively. However there are many other helpful string manipulation methods in pandas. For more information about string column operations, see Working with Text Data.

After converting to numeric datatypes, we see the new “Latest Close Numeric” column contains float values like 134.125, and the new “Pct Change Numeric” column contains float values like 0.02386.

We can now use these numeric values to perform calculations, for example calculating the average return, and determining whether there was a gain or loss:

df["Pct Change Numeric"].mean().round(4)
np.float64(0.0095)
df["Gain"] = df["Pct Change Numeric"] > 0
df.head()
Symbol Name Latest Close Net Change Pct Change Latest Close Numeric Pct Change Numeric Gain
0 A Agilent Technologies Inc. Common Stock $134.125 0.695 0.521% 134.125 0.00521 True
1 AA Alcoa Corporation Common Stock $35.77 -2.220 -5.844% 35.770 -0.05844 False
2 AACG ATA Creativity Global American Depositary Shares $0.901 0.021 2.386% 0.901 0.02386 True
3 AACT Ares Acquisition Corporation II Class A Ordina... $10.97 0.000 0.00% 10.970 0.00000 False
4 AADI Aadi Bioscience Inc. Common Stock $3.16 0.020 0.637% 3.160 0.00637 True

14.2 Datatype Casting

In the previous example, we used the to_numeric function to convert strings to numbers, however we can alternatively perform a wider variety of datatype casting using the astype method.

For example, casting between strings and numbers:

df["Latest Close Numeric"] = df["Latest Close"].str.lstrip("$").astype(float)

df["Latest Close Reconstructed"] = "$" + df["Latest Close Numeric"].astype(str)

df[["Symbol", "Latest Close", "Latest Close Numeric", "Latest Close Reconstructed"]].head()
Symbol Latest Close Latest Close Numeric Latest Close Reconstructed
0 A $134.125 134.125 $134.125
1 AA $35.77 35.770 $35.77
2 AACG $0.901 0.901 $0.901
3 AACT $10.97 10.970 $10.97
4 AADI $3.16 3.160 $3.16
df["Pct Change Numeric"] = df["Pct Change"].str.rstrip("%").astype(float) / 100

df["Pct Change Reconstructed"] = (df["Pct Change Numeric"] * 100).astype(str) + "%"

df[["Symbol", "Pct Change", "Pct Change Numeric", "Pct Change Reconstructed"]].head()
Symbol Pct Change Pct Change Numeric Pct Change Reconstructed
0 A 0.521% 0.00521 0.521%
1 AA -5.844% -0.05844 -5.844%
2 AACG 2.386% 0.02386 2.386%
3 AACT 0.00% 0.00000 0.0%
4 AADI 0.637% 0.00637 0.637%

And casting between booleans and integers:

df["Gain Binary"] = df["Gain"].astype(int)
df["Gain Reconstructed"] = df["Gain Binary"].astype(bool)

df[["Symbol", "Pct Change", "Gain", "Gain Binary", "Gain Reconstructed"]].head()
Symbol Pct Change Gain Gain Binary Gain Reconstructed
0 A 0.521% True 1 True
1 AA -5.844% False 0 False
2 AACG 2.386% True 1 True
3 AACT 0.00% False 0 False
4 AADI 0.637% True 1 True

After all these conversations, we can confirm the datatypes for good measure:

df.dtypes.sort_index()
Gain                             bool
Gain Binary                     int64
Gain Reconstructed               bool
Latest Close                   object
Latest Close Numeric          float64
Latest Close Reconstructed     object
Name                           object
Net Change                    float64
Pct Change                     object
Pct Change Numeric            float64
Pct Change Reconstructed       object
Symbol                         object
dtype: object