user6678274 user6678274 - 3 months ago 10
R Question

How to make groupBy to find the first action in R

In R I have a data.frame

data
where
head(data)
gives

user action information
12 2012-01-01 12323
11 2014-03-02 24445
12 2012-02-05 32234
....


I want to create a new dataset that only contains
user
and their birth, ie their first action. For
user
12 it's
2012-01-01
for example.

In sparkR I know how to do this but I was wondering how to do it in R.
In sparkR I simply did this

new=groupBy(data, data$user)
new_data=agg(new, birth=first(data$action))
# Making it local (from a DataFrame to a data.frame)
local_new_data=collect(new_data)


Now this list can be saved as a csv-file
write.csv("...")
.

Thanks.

Update

I had a data set in sparkR where I runned the sparkR-code to get a list of users and their birth. My problem is that I got a new computer and haven't installed sparkR on it (I'm still working hard on this). I simply need one to run my code in sparkR so I can get the list. I have both the dataset and code ready to execute. I really hope somebody can help me?

My answer

I tried to solve it a different way and for some reason it's running very fast. I simply did this since column action is sorted

s=data[!duplicated(data),]


Now
s
contains users where action is their birth. To only get them I simply do this

ss=cbind(as.character(data$user), as.character(data$action))


in this runs very fast in R for some reason.

My question is not duplicate - it differs much from the 2 other questions some claims.

Answer

In R, using dplyr, it is almost similar syntax as it also have the first function along with group_by (in place of groupby)

library(dplyr)
data %>%
     group_by(user) %>%
     summarise(birth = first(action))

Or another option is data.table

library(data.table)
setDT(data)[, .(birth = action[1L]) , by = user]