João_testeSW João_testeSW - 2 months ago 11
Scala Question

SPARK Mlllib - <console>:37: error: value filter is not a member of Long

​Hi,

​I've a dataset that group all the products by a Transaction_ID, and I want to exclude the Transaction_ID that have less than two products. For that I'm using this:


val edges = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID"))).count().filter("count >= 2")


But when I execute this I'm getting this error:

<console>:37: error: value filter is not a member of Long


How can I solve this problem?

Many thanks!

Answer

You can try like below.

val df = Seq(("tx-1", "aaa"), ("tx-2", "bbb"), ("tx-1", "ccc"),("tx-4", "ccc")).toDF("Transaction_ID", "Product_ID")
df.show

+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
|          tx-1|       aaa|
|          tx-2|       bbb|
|          tx-1|       ccc|
|          tx-4|       ccc|
+--------------+----------+

If you want Transaction_ID only then you can use

val df4 =df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
df4.show

If you want both Transaction_ID and Product_ID then

val df1 = df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
val df2 = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID")))
val df3 = df1.join(df2, df1("Transaction_ID") === df2("Transaction_ID"), "inner").select(df2("Transaction_ID"),df2("Product_ID"))
df3.show

+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
|          tx-1|   aaa,ccc|
+--------------+----------+