Rajarshi Bhadra Rajarshi Bhadra - 2 months ago 7
R Question

Regex to extract from string in R

I have a string

string = <td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>


I am using the code

library(stringr)
str_extract(string,"[A-Z]\\w+")


For this I am getting result

> str_extract(string,"[A-Z]\\w+")
[1] "Amar"


However I want "Amar Akbar Anthony" as my output. How should I change my regex suitably for this?

Answer

Note that your regex does not allow spaces. Add it as [\\w\\s]:

"[A-Z][\\w\\s]+"

Also, if your string is always in the format above, you do not even need the stringr library, use a base R gsub:

s <- "<td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>"
trimws(gsub("<[^>]+>","",s))
[1] "Amar Akbar Anthony"

See this online demo. The gsub("<[^>]+>","",s) will remove all open/close/etc. tags.

Or use the XML parsing library to grab the a tag values:

> library("XML")
> s <- "<td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>"
> parsed_doc = htmlParse(s, useInternalNodes = TRUE)
> res <- getNodeSet(doc = parsed_doc, path = "//a/text()")
> plain_text <- sapply(res, xmlValue)
> plain_text
[1] "Amar Akbar Anthony"
Comments