zundertj zundertj - 2 months ago 54
Python Question

Pandas dataframe converting specific columns from string to float

I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:

import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np

# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)

# download data
if not os.path.isfile(local_file):
print('Downloading and unzipping file!')
urllib.request.urlretrieve(url, local_zipfile)
zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))

# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)

# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]

# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates


All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:

df2.convert_objects(convert_numeric=True)


But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):

for d in df2.columns:
if d is not 'dates':
df2[d] = df2[d].map(lambda x: float(x)/100)


But this gives me the following warning:

home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
try:


I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks

Edit:

df2=df2.convert_objects(convert_numeric=True)


Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)

Some of df2:

dates Agric Food Soda Beer Smoke Toys Fun \
dates
192607 192607 2.37 0.12 -99.99 -5.19 1.29 8.65 2.50
192608 192608 2.23 2.68 -99.99 27.03 6.50 16.81 -0.76
192609 192609 -0.57 1.58 -99.99 4.02 1.26 8.33 6.42
192610 192610 -0.46 -3.68 -99.99 -3.31 1.06 -1.40 -5.09
192611 192611 6.75 6.26 -99.99 7.29 4.55 0.00 1.82


Edit2: the solution is actually more simple than I thought:

df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100

Answer Source

You need to assign the result of convert_objects as there is no inplace param:

df2=df2.convert_objects(convert_numeric=True)

you refer to the rename method but that one has an inplace param which you set to True.

Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.

Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..