Rocketq Rocketq - 12 days ago 6
Python Question

How to apply one-hot encoding to big files?

I have 3 Gb train file and 350 Mb test file, 6 Gb RAM. That files seem to be not too big to hold them in the memory, but I can't even append them (memory error occurs).

My current aproach looks this way:

# reading files
dtypes = {'Semana' : 'int32',
'Agencia_ID' :'int32',
'Canal_ID' : 'int32',
'Ruta_SAK' : 'int32',
'Cliente-ID' : 'int32',
'Producto_ID':'int32',
'Venta_hoy':'float32',
'Venta_uni_hoy': 'int32',
'Dev_uni_proxima':'int32',
'Dev_proxima':'float32',
'Demanda_uni_equil':'int32'}

train = pd.read_csv('train.csv', dtype = dtypes, usecols=["Semana", "Agencia_ID", "Canal_ID", 'Ruta_SAK', 'Cliente_ID', 'Producto_ID','Demanda_uni_equil'])
test = pd.read_csv('test.csv',dtype = dtypes)
ids = test['id']
test.drop(['id'], axis =1, inplace = True)

shapeTrain = train.shape[0]
shapeTest = test.shape[0]

train = train.append(test) # raises memory error

#applying one-hot encoding
train = pd.concat([train, pd.get_dummies(train['Semana'],sparse=True)], axis=1, join_axes=[train.index])
train.drop([11,'Semana'],axis=1, inplace = True)

train = pd.concat([train, pd.get_dummies(train['Producto_ID'],sparse=True)], axis=1, join_axes=[train.index])
train.drop([123,'Producto_ID'],axis=1, inplace = True)


train = pd.concat([train, pd.get_dummies(train['Canal_ID'],sparse=True)], axis=1, join_axes=[train.index])
train.drop([11,'Canal_ID'],axis=1, inplace = True)

# separating back to train and test
test = train[shapeTrain:shapeTrain+shapeTest]
train = train[0:shapeTrain]

test['id'] = ids

train.to_csv('train_1.csv', index=False)
test.to_csv('test_1.csv', index=False)


I see 2 ways to solve:

1) Somehow iterate through the files:

df_train = pd.read_csv('train.csv', chunksize=1500)
for chunk in df_train:
#apply one-hot encoding


2) To decrease amount of columns by applying one-hot encoding only to train, and then to test, but for new categorical value setting all columns to 0.

3) Use the hashing trick!

How can I solve the problem?

Answer

How about specifying the data type when loading:

types = {'col1': np.dtype(type),
     'col2': np.dtype(type),
     'col3' : np.dtype(type),
     'col4': np.dtype(type),
     'col5': np.dtype(type) }

train = pd.read_csv('train.csv', dtype=types)

run train.info() if you are able to load it and check the memory usage.