user3032689 user3032689 - 2 months ago 8
R Question

Delete duplicates in large dataset based on condition

I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:

test <- read.table(
text = "
ID YEAR Count Value
A 1900 1 10
A 1900 2 9
A 1900 3 8
A 1900 4 7
A 1900 5 6
A 1900 6 5
A 1900 7 4
A 1900 7 10
A 1900 7 9
A 1900 8 3
A 1900 8 10
A 1900 8 9
A 2000 1 10
A 2000 2 9
A 2000 3 8
A 2000 3 10
A 2000 3 9
B 1800 1 10
B 1800 2 9
B 1800 3 8
B 1800 3 10
B 1800 3 9
B 1800 4 7
B 1900 1 10
B 1900 2 9
B 1900 2 10
B 1900 2 9
",header=TRUE)
library(data.table)
setDT(test)


In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.

I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:

A 1900 7 4
A 1900 7 10
A 1900 7 9


I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only

A 1900 7 4


as a remainder.

Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.

I hope that was clear enough. If not, feel free to ask for any information that is missing.

Answer

In R, you can answer this with the following: test[,.(Value=min(Value)), by=.(ID, Year, Count)]

Here we are going through the data and finding the minimum value for each combination of ID, Year, and Count. This uses the data.table syntax from the package data.table

Comments