beddotcom beddotcom - 10 months ago 41
R Question

obtaining count of phrases contained between parentheses and containing specific character

There must be a simple answer to this, but I'm new to regex and couldn't find one.

I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:

regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]

The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.

I'd then like to create a new column containing counts of x v. y phrases in each row.

Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).

Thanks for helping a newbie!

Answer Source

I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.

df <- data.frame(text = c("(Roe v. Wade) is not about boats", 
                          "(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time", 
                          "I am trying to confuse you (this is not a court case)", 
                          "this one is also confusing (But with Capital Letters)", 
                          "this is confusing (With Capitols and v. d)"), 
                 stringsAsFactors = FALSE)

The regular expression I think you want is:

cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))", 
                    df$text, perl=T))

You can then get the number of cases and add it to your data frame with:

df$numCases <- vapply(cases, length, numeric(1))

As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.