Christian Staudt Christian Staudt - 2 months ago 30
Python Question

pandas: error on DataFrame.unstack

I wrote the following function to convert several columns of a dataframe into numeric values:

def factorizeMany(data, columns):
""" Factorize a bunch of columns in a data frame"""
data[columns] = data[columns].stack().rank(method='dense').unstack()
return data


Calling it like this

trainDataPre = factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])


gives me an error. I don't know where to look for the cause, possibly wrong input?

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-14-357f8a4b2ef8> in <module>()
1 #trainDataPre = trainDataMerged.drop(["people_id", "activity_id", "date"], axis=1)
2 #trainDataPre = trainDataMerged.fillna(0)
----> 3 trainDataPre = mininggear.factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])

/Users/cls/Dropbox/Datengràˆber/Kaggle/RedHat/mininggear.py in factorizeMany(data, columns)
15 def factorizeMany(data, columns):
16 """ Factorize a bunch of columns in a data frame"""
---> 17 data[columns] = data[columns].stack().rank(method='dense').unstack()
18 return data
19

/usr/local/lib/python3.5/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
2041 """
2042 from pandas.core.reshape import unstack
-> 2043 return unstack(self, level, fill_value)
2044
2045 # ----------------------------------------------------------------------

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level, fill_value)
405 else:
406 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407 fill_value=fill_value)
408 return unstacker.get_result()
409

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns, fill_value)
90
91 # when index includes `nan`, need to lift levels/strides by 1
---> 92 self.lift = 1 if -1 in self.index.labels[self.level] else 0
93
94 self.new_index_levels = list(index.levels)

AttributeError: 'Index' object has no attribute 'labels'

Answer

The error is due to the fact that you are trying to perform the rank operation on the subset of the dataframe containing both numerical and categorical/string values by filling the NaN's in the dataframe with 0 and calling that function.

Consider this case:

df = pd.DataFrame({'char_1': ['cat', 'dog', 'buffalo', 'cat'],
                   'char_2': ['mouse', 'tiger', 'lion', 'mouse'],
                   'char_3': ['giraffe', np.NaN, 'cat', np.NaN]})
df 

Image

df = df.fillna(0)
df[['char_3']].stack().rank()
Series([], dtype: float64)

So, you are basically performing the unstack operation on an empty series which is not what you wanted to do after all.

Better is to do this way to avoid further complications:

def factorizeMany(data, columns):
    """ Factorize a bunch of columns in a data frame"""
    stacked = data[columns].stack()
    data[columns] = pandas.Series(stacked.factorize()[0], index=stacked.index).unstack()
    return data