I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:
.keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
.reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
.groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))
GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of
agg method for example:
events .groupBy($"service_id", $"client_create_timestamp", $"client_id") .min("client_send_timestamp")
events .groupBy($"service_id", $"client_create_timestamp", $"client_id") .agg(min($"client_send_timestamp"))
client_send_timestamp is a column you want to aggregate.
If you want to keep information than aggregate just
join or use Window functions - see Find maximum row per group in Spark DataFrame
Spark also supports User Defined Aggregate Functions - see How can I define and use a User-Defined Aggregate Function in Spark SQL?
You could use
Dataset.groupByKey which exposes groups as an iterator.