user3032689 user3032689 - 1 year ago 60
R Question

Delete duplicates in large dataset based on condition

I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:

test <- read.table(
text = "
ID YEAR Count Value
A 1900 1 10
A 1900 2 9
A 1900 3 8
A 1900 4 7
A 1900 5 6
A 1900 6 5
A 1900 7 4
A 1900 7 10
A 1900 7 9
A 1900 8 3
A 1900 8 10
A 1900 8 9
A 2000 1 10
A 2000 2 9
A 2000 3 8
A 2000 3 10
A 2000 3 9
B 1800 1 10
B 1800 2 9
B 1800 3 8
B 1800 3 10
B 1800 3 9
B 1800 4 7
B 1900 1 10
B 1900 2 9
B 1900 2 10
B 1900 2 9
",header=TRUE)
library(data.table)
setDT(test)


In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.

I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:

A 1900 7 4
A 1900 7 10
A 1900 7 9


I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only

A 1900 7 4


as a remainder.

Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.

I hope that was clear enough. If not, feel free to ask for any information that is missing.

Answer Source

In R, you can answer this with the following: test[,.(Value=min(Value)), by=.(ID, Year, Count)]

Here we are going through the data and finding the minimum value for each combination of ID, Year, and Count. This uses the data.table syntax from the package data.table

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download