DataTx DataTx - 3 months ago 8
R Question

Subset rows by partially matching two columns for words greater than n characters

I would be open to doing this in python pandas, but in R I have the following df:

result<-structure(list(traffic_Count_Street = c("San Angelo", "W Commerce St",
"W Commerce St", "S Gevers St", "Austin Hwy", "W Evergreen St"
), unit_Street = c("San Pedro Ave", "W Commerce", "W Commerce",
"S New Braunfels", "Austin Highway", "W Cypress")), .Names = c("traffic_Count_Street",
"unit_Street"), row.names = c(1L, 17L, 18L, 34L, 260L, 273L), class = "data.frame")

1 San Angelo San Pedro Ave
17 W Commerce St W Commerce
18 W Commerce St W Commerce
34 S Gevers St S New Braunfels
260 Austin Hwy Austin Highway
273 W Evergreen St W Cypress


For each row I want to partially match column 1 to 2 if one of the words, greater than 3 characters, matches another.

I would remove:

1 San Angelo San Pedro Ave
34 S Gevers St S New Braunfels
273 W Evergreen St W Cypress


and keep:

17 W Commerce St W Commerce
18 W Commerce St W Commerce
260 Austin Hwy Austin Highway


I tried using
stringR
in the following way but it did not work:

result$unit_Street[str_detect(result$traffic_Count_Street, "\\w{3}")]

Answer

Create a distance filter that has a threshold adjustment. Then you can adjust until you are getting the results you want. A Levenshtein distance of 5 worked well in this case:

distanceFilter <- function(df, thresh=5) {
  ind <- apply(df, 1, function(x) adist(x[1], x[2]) < thresh )
  df[ind,]
}

distanceFilter(result, 5)
#     traffic_Count_Street    unit_Street
# 17         W Commerce St     W Commerce
# 18         W Commerce St     W Commerce
# 260           Austin Hwy Austin Highway

To learn more, see the wiki page and the R doc help page

Comments