jun.yoon77 jun.yoon77 - 1 month ago 16
R Question

Regex negative lookbehind in R

I'm trying to do a regex for a negative lookbehind in R.

So basically, I have a text data that looks something like this :

See item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 8 Financial Statements and Supplementary Data.


I want to select everything from the "Item 7" right after the "blahblahblah." sentence to "Item 8-Financial Statements and Supplementary Data"

So I want

Item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 8 Financial Statements and Supplementary Data.


which is everything except for the sentence that contains "see item 7 Management's Discussion and Analysis"

Right now, I'm working with this code:

(?<!see)Item 7(.*?)Item 8


But it's not returning what i want.

My logic is to not look at sentences that contain the word "see" followed by "item 7 Management's Discussion and Analysis" but it doesn't seem to be working.

https://regex101.com/r/yF7aQ1/3

Is there a way I can implement this negative lookbehind?

Answer

Not sure how you are implementing it in R, .*(?<!See) (item 7 .*) works with sub, just be careful with the space after the see and the letter case which you can ignore with ignore.case parameter.

sub(".*(?<!See) (item 7 .*)", "\\1", s, ignore.case = T, perl = T)

# [1] "Item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 8 Financial Statements and Supplementary Data."

Another alternative:

sub(".*(?=(?<!See) ?item 7)", "", s, ignore.case = T, perl = T)
# [1] "Item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 8 Financial Statements and Supplementary Data."