Maximilian Kohl Maximilian Kohl - 3 months ago 23
R Question

Why does countDistinct/n_distinct on SparkR column not work?

I want to count distinct elements of a SparkR column (of a SparkR dataframe):

df$col1
1
2
2
5
6
5


distinct elements: 1,2,5,6

When I try countDistinct on my SparkR Column, I only get this result:

> countDistinct(df$col1)
Column count(col1)


Do I have to use the agg function? I tried but failed because it doesn't seem to work on Columns.

Answer

And this is expected result. SparkR column is not a data container. It is just a representation of logical operation in the execution plan. If you to get a result you have evaluate it in a specific context:

# 2.0.0+ syntax
df <- createDataFrame(data.frame(col1=c(1, 2, 2, 5, 6, 5)))

collect(select(df, countDistinct(df$col1)))
##   count(DISTINCT col1)                                        
## 1                    4
Comments