mining mining - 6 months ago 32
Python Question

how to convert a PythonRDD with sparse data into dense PythonRDD

I want to use

StandardScaler
to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply
StandardScaler
, we should first convert it into dense types.

trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath)
trainLabel = trainData.map(lambda x: x.label)
trainFeatures = trainData.map(lambda x: x.features)
valLabel = valData.map(lambda x: x.label)
valFeatures = valData.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)

# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)

# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...

# using the scaled data, i.e., trainData1 and valData1 to train a model
...


The above code has errors. I have two questions:


  1. how to convert the sparse PythonRDD
    trainFeatures
    into dense tpye that can be as the inputs of
    StandardScaler
    ?

  2. How to merge
    trainLabel
    and
    trainFeatures_scaled
    into a new LabeledPoint that can be used to train a classifier (e.g. random forest)?



I still find any documents or references about this.

Answer

To convert to dense map using toArray:

dense = valFeatures.map(lambda v: DenseVector(v.toArray()))

To merge zip:

valLabel.zip(dense).map(lambda (l, f): LabeledPoint(l, f))