I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possibility is in Holden's answer, and this is another one :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class, which is still experimental for now (Spark 1.4.1)
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) ::Experimental:: Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val list = List((2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1))
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(list.toSeq).map(x => (x._1,(x._2,x._3)))
I'll set up the fractions for each key as following since you've noticed that the fractions argument in sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is actually mapping on the keys to find distinct and then associate each key to a fraction equals ot 0.8 then I collect the whole as a Map.
To sample now, all I have to do is :
import org.apache.spark.rdd.PairRDDFunctions val sampleData = data.sampleByKeyExact(false,fractions,2L)
val sampleData = data.sampleByKeyExact(withReplacement = false,fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count [...] res10: Long = 12 scala > sampleData.count [...] res11: Long = 10