João_testeSW João_testeSW - 1 year ago 82
Scala Question

SPARK Mlllib - <console>:37: error: value filter is not a member of Long


​I've a dataset that group all the products by a Transaction_ID, and I want to exclude the Transaction_ID that have less than two products. For that I'm using this:

val edges = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID"))).count().filter("count >= 2")

But when I execute this I'm getting this error:

<console>:37: error: value filter is not a member of Long

How can I solve this problem?

Many thanks!

Answer Source

You can try like below.

val df = Seq(("tx-1", "aaa"), ("tx-2", "bbb"), ("tx-1", "ccc"),("tx-4", "ccc")).toDF("Transaction_ID", "Product_ID")

|          tx-1|       aaa|
|          tx-2|       bbb|
|          tx-1|       ccc|
|          tx-4|       ccc|

If you want Transaction_ID only then you can use

val df4 =df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)

If you want both Transaction_ID and Product_ID then

val df1 = df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
val df2 = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID")))
val df3 = df1.join(df2, df1("Transaction_ID") === df2("Transaction_ID"), "inner").select(df2("Transaction_ID"),df2("Product_ID"))

|          tx-1|   aaa,ccc|
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download