Daniel - 2 months ago 8
R Question

# R - How delete two quasi-identical rows of a data frame?

I have a data frame, and i need depurate it according with two variables but both variables are "quasi-identical" in the rows. It mean that they can have a

`-`
or
`'`
or
`s`
or
`:`
or a space in one row but in another row dont have it.
I did use
`unique()`
but this function only works with identical values. Suppose that we have this
`data.frame`

``````Id<-c("RoLu1976","Rolu1976","AlBl1989","ThSa1996")
Art<-c("Econometric Policy Evaluation: A Critique","Econometric Policy Evaluations A Critique", "Rules after discretion", "Expectations and the Nonneutrality of Lucas")
Id.1<-c("FiKy1989","EdPr1986","BeBe1983","JoSt1989")
Art.1<-c("Notes on the Lucas Critique","Notes on the Lucas Critique","The Inconsistency of Optimal Plans","The Inconsistency of Optimal Plans")
N<-data.frame(Id,Art,Id.1,Art.1)
``````

The quasi identical values are in the variable
`Art`
on the two first observation, which are different just for a
`s`
and
`:`
. How can I filter and delete these kind of values?

Based on your data, I used `agrep` to match similar strings:

``````yy = NULL
for(i in 1:length(N\$Art)){
temp = agrep(N[i,"Art"],N\$Art,value=T)
y = ifelse(any(N[i,"Art"]==temp),temp[1],N[i,"Art"])
yy = c(yy,y)
}
``````

Then replaced `N\$Art` with `yy`, which will allow you to use `duplicated/unique`:

``````N\$Art = yy
N.2 = N[!duplicated(N\$Art), ]
``````