Christian Staudt Christian Staudt - 11 months ago 92
Python Question

pandas: error on DataFrame.unstack

I wrote the following function to convert several columns of a dataframe into numeric values:

def factorizeMany(data, columns):
""" Factorize a bunch of columns in a data frame"""
data[columns] = data[columns].stack().rank(method='dense').unstack()
return data

Calling it like this

trainDataPre = factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])

gives me an error. I don't know where to look for the cause, possibly wrong input?

AttributeError Traceback (most recent call last)
<ipython-input-14-357f8a4b2ef8> in <module>()
1 #trainDataPre = trainDataMerged.drop(["people_id", "activity_id", "date"], axis=1)
2 #trainDataPre = trainDataMerged.fillna(0)
----> 3 trainDataPre = mininggear.factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)])

/Users/cls/Dropbox/Datengràˆber/Kaggle/RedHat/ in factorizeMany(data, columns)
15 def factorizeMany(data, columns):
16 """ Factorize a bunch of columns in a data frame"""
---> 17 data[columns] = data[columns].stack().rank(method='dense').unstack()
18 return data

/usr/local/lib/python3.5/site-packages/pandas/core/ in unstack(self, level, fill_value)
2041 """
2042 from pandas.core.reshape import unstack
-> 2043 return unstack(self, level, fill_value)
2045 # ----------------------------------------------------------------------

/usr/local/lib/python3.5/site-packages/pandas/core/ in unstack(obj, level, fill_value)
405 else:
406 unstacker = _Unstacker(obj.values, obj.index, level=level,
--> 407 fill_value=fill_value)
408 return unstacker.get_result()

/usr/local/lib/python3.5/site-packages/pandas/core/ in __init__(self, values, index, level, value_columns, fill_value)
91 # when index includes `nan`, need to lift levels/strides by 1
---> 92 self.lift = 1 if -1 in self.index.labels[self.level] else 0
94 self.new_index_levels = list(index.levels)

AttributeError: 'Index' object has no attribute 'labels'

Answer Source

The error is due to the fact that you are trying to perform the rank operation on the subset of the dataframe containing both numerical and categorical/string values by filling the NaN's in the dataframe with 0 and calling that function.

Consider this case:

df = pd.DataFrame({'char_1': ['cat', 'dog', 'buffalo', 'cat'],
                   'char_2': ['mouse', 'tiger', 'lion', 'mouse'],
                   'char_3': ['giraffe', np.NaN, 'cat', np.NaN]})


df = df.fillna(0)
Series([], dtype: float64)

So, you are basically performing the unstack operation on an empty series which is not what you wanted to do after all.

Better is to do this way to avoid further complications:

def factorizeMany(data, columns):
    """ Factorize a bunch of columns in a data frame"""
    stacked = data[columns].stack()
    data[columns] = pandas.Series(stacked.factorize()[0], index=stacked.index).unstack()
    return data