ulrich ulrich - 1 month ago 8x
Scala Question

UDF to randomly assign values based on different probabilities

I would like to create a UDF to randomly assign values based on different probabilities.

In the following example depending of the value returned by rand:

  • 0 to 0.5 the value should be A (50% probability)

  • 0.8 to 1 the value should be B (20% probability)

  • anything else the value should be c (30% probability)

val names = Array("A", "B", "C")

val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})

val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))

However when I print the result the names are not assigned based on the rules defined in the UDF.

| id| val|name|
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C


There is nothing unexpected going on here. You call rand function twice so you get two different random values.

Either provide the same seed for both calls:

sqlContext.range(0, 100)

or reuse the value:

sqlContext.range(0, 100)
  .withColumn("val", abs(rand))
  .withColumn("name", allocate($"val"))