ulrich ulrich - 2 months ago 21
Scala Question

UDF to randomly assign values based on different probabilities

I would like to create a UDF to randomly assign values based on different probabilities.

In the following example depending of the value returned by rand:


  • 0 to 0.5 the value should be A (50% probability)

  • 0.8 to 1 the value should be B (20% probability)

  • anything else the value should be c (30% probability)



val names = Array("A", "B", "C")


val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})

val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))
`


However when I print the result the names are not assigned based on the rules defined in the UDF.

+---+----+----+
| id| val|name|
+---+----+----+
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C

Answer

There is nothing unexpected going on here. You call rand function twice so you get two different random values.

Either provide the same seed for both calls:

sqlContext.range(0, 100)
  .select(
    $"id", 
    abs(rand(1)).alias("val"),
    allocate(abs(rand(1))).alias("name") 
  )

or reuse the value:

sqlContext.range(0, 100)
  .withColumn("val", abs(rand))
  .withColumn("name", allocate($"val"))