Daniel Daniel - 10 months ago 44
R Question

Conditional delete rows: delete the quasi-identical rows but not the identical

I have a data frame, and I need depurate it according with two values that are "quasi-identical" in the rows. I only need to delete the observations that differs but not the identical. I try do this using

but this function also delete the identical observations.

Art<-c("Econometric Policy Evaluation: A Critique","Econometric Policy Evaluations A Critique","Econometric Policy Evaluation: A Critique", "Rules after discretion", "Expectations and the Nonneutrality of Lucas")
Art.1<-c("Notes on the Lucas Critique","Notes on the Lucas Critique","The Inconsistency of Optimal Plans","The Inconsistency","Notes on the Lucas")

The quasi identical values in the above
is in
column on the two first observation, which are different just for a

In the above case the final data frame should be (note that the identical values wasn't delete):

Id Art Id.1 Art.1
RoLu1976 Econometric Policy Evaluation: A Critique FiKy1989 Notes on the Lucas Critique
RoLu1976 Econometric Policy Evaluation: A Critique BeBe1983 The Inconsistency of Optimal Plans
AlBl1989 Rules after discretion JoSt1989 The Inconsistency
ThSa1996 Expectations and the Nonneutrality of Lucas JoSt1990 Notes on the Lucas

What I did was this:

yy = NULL
for(i in 1:length(N$Art)){
temp = agrep(N[i,"Art"],N$Art,value=T)
y = ifelse(any(N[i,"Art"]==temp),temp[1],N[i,"Art"])
yy = c(yy,y)
N$Art = yy
N.2 = N[!duplicated(N$Art), ]

But it delete both values: identical and quasi identical.

How can I do it?

Answer Source

You could store the indices of things that are identical in the original Art column, and use that in combination with the results after de-duplication, e.g.

originallyDuplicated <- duplicated(N$Art)
# then run your snippet to generate `yy`

So you want to get rid of things that are duplicated now, but not originally.

N[!(duplicated(yy) & !originallyDuplicated),]

Though to me it seems that rather than basing your exclusion criteria purely on the Art column, it would make more sense to exclude a row if every column in the row was duplicated (or almost duplicated) elsewhere in the table. (e.g. compare on the Art.1, Id.1, ID etc column too?)