djturbine djturbine - 3 months ago 17
R Question

Cleaning Street Addresses in Text Mining

Looking for a way to remove street addresses from the text I currently have. Is there a regular expression that can detect text within range of numbers? What I'm thinking is that I have a zip code and usually a number at the start of the address.

1234 Parks St., Los Angeles, CA 90001

My main issue is that I want to remove the street name from my dataset while I do my other cleaning and look for other words within my set.

I am using Rstudio to do the cleaning.

42- 42-
Answer

This returns a character vector. Read the regex as breadking it into three capture-groups with the parens: the first is any count of consecutive digits, followed by any number of non-digits, followed by 5 digits. Retrn only the first and the third with a space in-between (if there is a match) and make no change if no match;

> gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test)
[1] "1234 90001" "9876 94501"

It would need further parsing to return a set of numeric vectors

> scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), what=list("", "") )
Read 2 records
[[1]]
[1] "1234" "9876"

[[2]]
[1] "90001" "94501"

Probably better to read ub zips as character but could convert the street numbers to numeric:

> scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), what=list( numeric(), "") )
Read 2 records
[[1]]
[1] 1234 9876

[[2]]
[1] "90001" "94501"

To make this more useful:

> setNames( data.frame( scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), 
                              what=list( numeric(), "") ) , 
                       stringsAsFactors=FALSE), 
            c( "StrtNumber", "ZIP") )
Read 2 records
  StrtNumber   ZIP
1       1234 90001
2       9876 94501