helloB helloB - 29 days ago 11
Python Question

Warning: multiple data types in column of very large dataframe

I have a fairly large pandas DataFrame read in from csv (~3 million rows & 72 columns), and I am getting warnings that some of the columns contain mixed data types:

DtypeWarning: Columns (1,2,3,15,16,17,18,19,20,21,22,23,31,32,33,35,37,38,39,40,41,42,43,44,45,46,47,48,50,51,52,55,57,58,60,71) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)


What's the best way to deal with this given that I can't just eyeball the csv? In particular, is there a way to get a list of all the data types that occur in a given column and what their corresponding row numbers are?

Answer

consider the following df

df = pd.DataFrame(dict(col1=[1, '1', False, np.nan, ['hello']],
                       col2=[2, 3.14, 'hello', (1, 2, 3), True]))
df = pd.concat([df for _ in range(2)], ignore_index=True)

df

enter image description here

You could investigate the different types and how many of them there are with

df.col1.apply(type).value_counts()

<type 'float'>    2
<type 'int'>      2
<type 'list'>     2
<type 'bool'>     2
<type 'str'>      2
Name: col1, dtype: int64

you could investigate which rows of col1 are float like this

df[df.col1.apply(type) == float]

enter image description here