valentjedi valentjedi - 1 month ago 14
Python Question

RandomForestClassifier.fit uses different amount of RAM on different machines

for some reason

RandomForestClassifier.fit
from
sklearn.ensemble
uses only 2.5GB RAM on my local machine but almost 7GB on my server with absolutely same training set.

The code without imports is pretty much this:

y_train = data_train['train_column']
x_train = data_train.drop('train_column', axis=1)

# Difference in memory consuming starts here
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf = clf.fit(x_train, y_train)
preds = clf.predict(data_test)


My local machine is macbook pro with 16GB of memory and 4 core CPU
My server is Ubuntu server on digitalocean cloud with 8 GB of memory and 4 core CPU too.

Version of sklearn is 0.18, Python version is 3.5.2

I can't even imagine possible reasons, any help will be very helpful.

UPDATE

Memory Error appears in this code inside the
fit
method:

# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
backend="threading")(
delayed(_parallel_build_trees)(
t, self, X, y, sample_weight, i, len(trees),
verbose=self.verbose, class_weight=self.class_weight)
for i, t in enumerate(trees))


UPDATE 2

Informaiton about my systems:

# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18


Also my numpy configs:

# server
>>> np.__config__.show()
blas_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
openblas_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
lapack_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
blas_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c


# local
>>> np.__config__.show()
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blas_mkl_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3']
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE


Repr of
clf
object is the same on both machines:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False)

Answer

Well, issue magically gone after I updated kernel from 3.13.0-57 to 4.4.0-28. Now it eats even less memory than my local mac laptop.