Asep Bagja Priandana Asep Bagja Priandana - 4 months ago 14
R Question

How To Remove Duplicate Rows By Column's Value in R?

I have the data frame below



Email | Action | Time
-------------- | -------- | -------------------
a@mydomain.com | checkin | 2016-07-11 08:50:59
b@mydomain.com | checkin | 2016-07-11 08:51:42
a@mydomain.com | checkin | 2016-07-11 08:55:13 -> this is duplicate data
a@mydomain.com | checkout | 2016-07-11 19:15:20
c@mydomain.com | checkin | 2016-07-12 08:58:20
a@mydomain.com | checkin | 2016-07-12 09:15:00
c@mydomain.com | checkin | 2016-07-12 09:20:13 -> this is duplicate data
a@mydomain.com | checkout | 2016-07-12 20:20:10



I need to remove the duplicate rows with the same action value (for this case are
checkin
and
checkout
) within the same date (
YYYY-MM-DD
). It should be like this:



Email | Action | Time
-------------- | -------- | -------------------
a@mydomain.com | checkin | 2016-07-11 08:50:59
b@mydomain.com | checkin | 2016-07-11 08:51:42
a@mydomain.com | checkout | 2016-07-11 19:15:20
c@mydomain.com | checkin | 2016-07-12 08:58:20
a@mydomain.com | checkin | 2016-07-12 09:15:00
a@mydomain.com | checkout | 2016-07-12 20:20:10


I know I can
substr
the Time column to get the
YYYY-MM-DD
, but I don't have any idea how to remove the duplicate rows. Thanks in advance for your help :)

Answer

First create a column called Days to keep dates in the format YYYY-d-m. Then use the combination of Email, Action and Days to find the duplicates and remove them.

##Create the column
flights_info$Days <- as.POSIXct(strftime(flights_info$Time, format = "%Y-%d-%m %H:%M:%S"), format="%Y-%d-%m")

##Remove the duplicates
flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),]

Desired output:

Email          Action   Time                Days      
 a@mydomain.com checkin  2016-07-11 08:50:59 2016-07-11
 b@mydomain.com checkin  2016-07-11 08:51:42 2016-07-11
 a@mydomain.com checkout 2016-07-11 19:15:20 2016-07-11
 c@mydomain.com checkin  2016-07-12 08:58:20 2016-07-12
 a@mydomain.com checkin  2016-07-12 09:15:00 2016-07-12
 a@mydomain.com checkout 2016-07-12 20:20:10 2016-07-12

If you wish not to have the Days column anymore, you can remove it with:

flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),-ncol(flights_info)]

I hope this helps.

Comments