Syzorr Syzorr - 1 month ago 6
R Question

Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:

id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue


The first 2 digits of
reg_period
are the year they sat the paper. As can be seen, I would want to be keeping where
id
is 1507 and
reg_period
is 1101. So far, an example of my code to get the values I want to be trimming is:

unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])


However, there are a couple of problems I am then running in to. This only works because the data is ordered by
id
and
reg_period
and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because
%in%
doesn't seem to work with it and a loop with
rbind
runs out of memory.

What's the best way to handle this?

Answer

I would probably use dplyr. Calling your data df:

result = df %>% group_by(id) %>%
    filter(period == min(period))

If you prefer base, I would pull the id/period combinations to keep into a separate data frame and then do an inner join with the original data:

id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)