mining mining - 5 months ago 10
Python Question

how to convert a PythonRDD with sparse data into dense PythonRDD

I want to use

to scale the data. I've loaded the data into a PythonRDD. It seems the data is sparse. To apply
, we should first convert it into dense types.

trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath)
trainLabel = x: x.label)
trainFeatures = x: x.features)
valLabel = x: x.label)
valFeatures = x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)

# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)

# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...

# using the scaled data, i.e., trainData1 and valData1 to train a model

The above code has errors. I have two questions:

  1. how to convert the sparse PythonRDD
    into dense tpye that can be as the inputs of

  2. How to merge
    into a new LabeledPoint that can be used to train a classifier (e.g. random forest)?

I still find any documents or references about this.


To convert to dense map using toArray:

dense = v: DenseVector(v.toArray()))

To merge zip: (l, f): LabeledPoint(l, f))