Daniel Daniel - 2 months ago 6
R Question

Conditional delete rows: delete the quasi-identical rows but not the identical

I have a data frame, and I need depurate it according with two values that are "quasi-identical" in the rows. I only need to delete the observations that differs but not the identical. I try do this using

agrep
but this function also delete the identical observations.

Id<-c("RoLu1976","Rolu1976","RoLu1976","AlBl1989","ThSa1996")
Art<-c("Econometric Policy Evaluation: A Critique","Econometric Policy Evaluations A Critique","Econometric Policy Evaluation: A Critique", "Rules after discretion", "Expectations and the Nonneutrality of Lucas")
Id.1<-c("FiKy1989","FiKy1989","BeBe1983","JoSt1989","JoSt1990")
Art.1<-c("Notes on the Lucas Critique","Notes on the Lucas Critique","The Inconsistency of Optimal Plans","The Inconsistency","Notes on the Lucas")
N<-data.frame(Id,Art,Id.1,Art.1)


The quasi identical values in the above
dataframe
is in
Art
column on the two first observation, which are different just for a
s
and
:
.

In the above case the final data frame should be (note that the identical values wasn't delete):

Id Art Id.1 Art.1
RoLu1976 Econometric Policy Evaluation: A Critique FiKy1989 Notes on the Lucas Critique
RoLu1976 Econometric Policy Evaluation: A Critique BeBe1983 The Inconsistency of Optimal Plans
AlBl1989 Rules after discretion JoSt1989 The Inconsistency
ThSa1996 Expectations and the Nonneutrality of Lucas JoSt1990 Notes on the Lucas


What I did was this:

yy = NULL
for(i in 1:length(N$Art)){
temp = agrep(N[i,"Art"],N$Art,value=T)
y = ifelse(any(N[i,"Art"]==temp),temp[1],N[i,"Art"])
yy = c(yy,y)
}
N$Art = yy
N.2 = N[!duplicated(N$Art), ]


But it delete both values: identical and quasi identical.

How can I do it?

Answer

You could store the indices of things that are identical in the original Art column, and use that in combination with the results after de-duplication, e.g.

originallyDuplicated <- duplicated(N$Art)
# then run your snippet to generate `yy`

So you want to get rid of things that are duplicated now, but not originally.

N[!(duplicated(yy) & !originallyDuplicated),]

Though to me it seems that rather than basing your exclusion criteria purely on the Art column, it would make more sense to exclude a row if every column in the row was duplicated (or almost duplicated) elsewhere in the table. (e.g. compare on the Art.1, Id.1, ID etc column too?)