Removing Duplicates From a Dataframe in R

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:

id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue

The first 2 digits of
are the year they sat the paper. As can be seen, I would want to be keeping where
is 1507 and
is 1101. So far, an example of my code to get the values I want to be trimming is:

unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])

However, there are a couple of problems I am then running in to. This only works because the data is ordered by
and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because
doesn't seem to work with it and a loop with
runs out of memory.

What's the best way to handle this?

Answer Source

I would probably use dplyr. Calling your data df:

result = df %>% group_by(id) %>%
    filter(period == min(period))

If you prefer base, I would pull the id/period combinations to keep into a separate data frame and then do an inner join with the original data:

id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)
