Belphegor Belphegor - 4 months ago 25x
Python Question

Python regex: XOR operator

Suppose I have strings like these:

  1. "DT NN IN NN"

  2. "DT RB JJ NN"

  3. "DT JJ JJ NN"

  4. "DT RB RB NN NN"

  5. "DT RB RB"

So, I have a list of strings:

list = ["DT NN IN NN", "DT RB JJ NN", "DT JJ JJ NN", "DT RB RB NN NN", "DT RB RB"]

I have the following code:

pattern = "(?:DT\s+)+([?:RB\s+|?:JJ\s+])+(?:NN\s+)*NN$"
for item in list:
m = re.match(pattern, item)
if m:
print item

What I want from
is to match the strings that start with
(appears one or more times) have either
(appearing once or more), but not both, and then to end with
(again, appearing once or more).

So, in the final result I should get 3 and 4 printed on the screen. However, with my regex, in addition I get 2, which I don't want. How do I change
so this could work? How to replace the pipe (OR) with a XOR?


The problem is in how you define the presence of RB and JJ. You haven't mentioned that only either of them should be present. This can be achieved by separating them with a | (pipe) and letting either of them repeat one or more times (+). Try changing your pattern to this:

pattern = "(?:DT\s+)+(?:(RB\s+)+|(JJ\s+)+)(?:NN\s+)*NN$"

Besides, (?:<something>) is called a non-capturing group. You use it to say "I want <something> to be matched, but not included when I select groups later. And by the looks of it, you are not using any groups. You are simply printing the whole item (unless you've masked the code for brevity). If you actually do not need groups, here is a simple version that would work for you:

pattern = "(DT\s+)+((RB\s+)+|(JJ\s+)+)(NN\s*)*NN$"

I have also let the ending set of white spaces occur 0 or more times, instead of one or more times like you original pattern. Feel free to change it.