I have a fairly large training matrix (over 1 billion rows, two features per row). There are two classes (0 and 1).
This is too large for a single machine, but fortunately I have about 200 MPI hosts at my disposal. Each is a modest dual-core workstation.
Feature generation is already successfully distributed.
The answers in Multiprocessing scikit-learn suggest it is possible to distribute the work of a SGDClassifier:
You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
local_matrix = get_local_matrix()
local_vector = get_local_vector()
estimator = linear_model.SGDClassifier()
estimator.partial_fit(local_matrix, local_vector, [0,1])
comm.send((estimator.coef_,estimator.intersept_),dest=0,tag=rank)
average_coefs = None
avg_intercept = None
comm.bcast(0,root=0)
if rank > 0:
comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank)
else:
pairs = [comm.recv(source=r, tag=r) for r in range(1,size)]
pairs.append( (estimator.coef_, estimator.intercept_) )
average_coefs = np.average([ a[0] for a in pairs ],axis=0)
avg_intercept = np.average( [ a[1][0] for a in pairs ] )
estimator.coef_ = comm.bcast(average_coefs,root=0)
estimator.intercept_ = np.array( [comm.bcast(avg_intercept,root=0)] )
estimator.partial_fit(metric_matrix, edges_exist,[0,1])
if rank > 0:
comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank)
else:
pairs = [comm.recv(source=r, tag=r) for r in range(1,size)]
pairs.append( (estimator.coef_, estimator.intercept_) )
average_coefs = np.average([ a[0] for a in pairs ],axis=0)
avg_intercept = np.average( [ a[1][0] for a in pairs ] )
estimator.coef_ = average_coefs
estimator.intercept_ = np.array( [avg_intercept] )
print("The estimator at rank 0 should now be working")
Training a linear model on a dataset with 1e9 samples and 2 features is very likely to underfit or waste CPU / IO time in case the data is actually linearly separable. Don't waste time thinking about parallelizing such a problem with a linear model:
either switch to a more complex class of models (e.g. train random forests on smaller partitions of the data that fit in memory and aggregate them)
or either select random subsamples of your dataset of increasing and train linear models. Measure the predictive accuracy on an held out test and stop when you see diminishing returns (probably after a couple 10s of thousands of samples of the minority class).