theK_S theK_S - 1 month ago 11
R Question

String replacement using sub function

I am attempting to extract the names of NBA players from a column in a database. However, the format of the the names in the names column is the following:

"LeBron James\\jamesle01"

I used the following regex expression inside a sub function to attempt to keep only the name portion:

sub("([A-Z]\\w+\\s*-*'*[a-z]*\\s*\\.*|[A-Z]\\.\\s*)\\*\\*[a-z]*\\d*\\d*", replacement = "\\1", x = nba_salaries$Names)


The expression is meant to take into account for unusual names that contain more than just alphanumeric characters (e.g. Michael Kidd-Gilchrist, De'Andre Jordan, Luc Mbah a Moute, etc.)

However, when I run the following,

head(nba_salaries$Names)


The names end up being in the same format.

I have used regexr.com to ensure that the regex expression captures the strings properly.

Answer

How about this, you can split the text by the "\\" string, and then take only the first element:

text <- c( "LeBron James\\jamesle01", "Michael Jordan\\jamesle01" )

sapply( strsplit( text, "\\\\" ), "[", 1 )

Which gives

[1] "LeBron James"   "Michael Jordan"

To explain. The "[" is a function, which is being called within sapply. So we pass the result of strsplit as the X in saaply, and apply the [ function to it with the parameter 1 to take the 1st element. Here's another way to put it:

text <- strsplit( text, "\\\\" )

This will output a list, with each list element containing a vector, where the first element is the text before the "\\" string, and the second element contains any text after it. Then we use the "[" function, passing the parameter 1, to take the first element of each of those vectors:

text <- sapply( X = text, FUN = "[", 1 )