Suppose I have strings like these:
"DT NN IN NN"
"DT RB JJ NN"
"DT JJ JJ NN"
"DT RB RB NN NN"
"DT RB RB"
list = ["DT NN IN NN", "DT RB JJ NN", "DT JJ JJ NN", "DT RB RB NN NN", "DT RB RB"]
pattern = "(?:DT\s+)+([?:RB\s+|?:JJ\s+])+(?:NN\s+)*NN$"
for item in list:
m = re.match(pattern, item)
The problem is in how you define the presence of
JJ. You haven't mentioned that only either of them should be present. This can be achieved by separating them with a
| (pipe) and letting either of them repeat one or more times (
+). Try changing your pattern to this:
pattern = "(?:DT\s+)+(?:(RB\s+)+|(JJ\s+)+)(?:NN\s+)*NN$"
(?:<something>) is called a non-capturing group. You use it to say "I want
<something> to be matched, but not included when I select groups later. And by the looks of it, you are not using any groups. You are simply printing the whole
item (unless you've masked the code for brevity). If you actually do not need groups, here is a simple version that would work for you:
pattern = "(DT\s+)+((RB\s+)+|(JJ\s+)+)(NN\s*)*NN$"
I have also let the ending set of white spaces occur 0 or more times, instead of one or more times like you original pattern. Feel free to change it.