Duesentrieb Duesentrieb - 2 months ago 24
Python Question

get_dummies python memory error

i relativly new to Python and i have a little problem with a data set. The data set has 400.000 rows and 300 variables. I have to get dummy variables for a categorical variable with 3000+ different items. At the end I want to end up with a data set with 3300 variables or features so that i can train RandomForest model.

Here is what I've tried to do:

df = pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1)


When i do that I'll always get an memory error. Is there a limit to the number of variables i can have?

If I do that with only the first 1000 rows (which got 374 different categories) it just works fin.

Does anyone have a solution for my problem?

The Machine I'm using is a Intel I7 with 8 GB Ram.

Thank you

Answer

update: looks like get_dummies is going to be returning integers by default, starting with version 0.19.0 (https://github.com/pydata/pandas/issues/8725)

Here are a couple of possibilities to try. Both will reduce the memory footprint of the dataframe substantially but you could still run into memory issues later. It's hard to predict, you'll just have to try.

(note that I am simplifying the output of info() below)

df = pd.DataFrame({ 'itemID': np.random.randint(1,4,100) })

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 3.5 KB

Here's our baseline. Each dummy column takes up 800 bytes because the sample data has 100 rows and get_dummies appears to default to float64 (8 bytes). This seems like an unnecessarily inefficient way to store dummies as you could use as little as a bit to do it, but there may be some reason for that which I'm not aware of.

So, first attempt, just change to a one byte integer (this doesn't seem to be an option for get_dummies so it has to be done as a conversion with astype(np.int8).

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_').astype(np.int8)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null int8
itemID__2    100 non-null int8
itemID__3    100 non-null int8

memory usage: 1.5 KB

Each dummy column now takes up 1/8 the memory as before.

Alternatively, you can use the sparse option of get_dummies.

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_',sparse=True)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 2.0 KB

Fairly comparable savings. The info() output somewhat hides the way savings are occurring, but you can look at the value of memory usage to see to total savings.

It's hard to say which of these will work better in practice (if either is a practical solution at all), so you'll just need to give them each a try. In theory you could even combine the two approaches, but I wouldn't try that unless it turns out each approach works on it's own.

Comments