Belphegor Belphegor - 5 months ago 43
Python Question

Python regex: XOR operator

Suppose I have strings like these:


  1. "DT NN IN NN"

  2. "DT RB JJ NN"

  3. "DT JJ JJ NN"

  4. "DT RB RB NN NN"

  5. "DT RB RB"



So, I have a list of strings:

list = ["DT NN IN NN", "DT RB JJ NN", "DT JJ JJ NN", "DT RB RB NN NN", "DT RB RB"]


I have the following code:

pattern = "(?:DT\s+)+([?:RB\s+|?:JJ\s+])+(?:NN\s+)*NN$"
for item in list:
m = re.match(pattern, item)
if m:
print item


What I want from
pattern
is to match the strings that start with
DT
(appears one or more times) have either
RB
or
JJ
(appearing once or more), but not both, and then to end with
NN
(again, appearing once or more).

So, in the final result I should get 3 and 4 printed on the screen. However, with my regex, in addition I get 2, which I don't want. How do I change
pattern
so this could work? How to replace the pipe (OR) with a XOR?

Answer

The problem is in how you define the presence of RB and JJ. You haven't mentioned that only either of them should be present. This can be achieved by separating them with a | (pipe) and letting either of them repeat one or more times (+). Try changing your pattern to this:

pattern = "(?:DT\s+)+(?:(RB\s+)+|(JJ\s+)+)(?:NN\s+)*NN$"

Besides, (?:<something>) is called a non-capturing group. You use it to say "I want <something> to be matched, but not included when I select groups later. And by the looks of it, you are not using any groups. You are simply printing the whole item (unless you've masked the code for brevity). If you actually do not need groups, here is a simple version that would work for you:

pattern = "(DT\s+)+((RB\s+)+|(JJ\s+)+)(NN\s*)*NN$"

I have also let the ending set of white spaces occur 0 or more times, instead of one or more times like you original pattern. Feel free to change it.

Comments