datooa datooa - 6 months ago 12
Python Question

Regex in Python 3: match everything after a number or optional period but before an optional comma

I'm trying to return ingredients from recipes without any measurements or directions. Ingredients are lists and appear like the following:

['1 medium tomato, cut into 8 wedges',
'4 c. torn mixed salad greens',
'1/2 small red onion, sliced and separated into rings',
'1/4 small cucumber, sliced',
'1/4 c. sliced pitted ripe olives',
'2 Tbsp. reduced-calorie Italian salad dressing',
'2 Tbsp. lemon juice',
'1 Tbsp. water',
'1/2 tsp. dried mint, crushed',
'1/4 c. crumbled Feta cheese or 2 Tbsp. crumbled Blue cheese']


I want to return the following list:

['medium tomato',
'torn mixed salad greens',
'small red onion',
'small cucumber',
'sliced pitted ripe olives',
'reduced-calorie Italian salad dressing',
'lemon juice',
'water',
'dried mint',
'crumbled Blue cheese']


The closest pattern I've found is with:

pattern = '[\s\d\.]* ([^\,]+).*'


but in testing with:

for ing in ingredients:
print(re.findall(pattern, ing))


the periods after each measurement abbreviation are returned as well, e.g.:

['c. torn mixed salad greens']


while

pattern = '(?<=\. )[^.]*$'


fails to capture instances with no periods, and captures the comma if both appear, i.e.:

[]
['torn mixed salad greens']
[]
[]
['sliced pitted ripe olives']
['reduced-calorie Italian salad dressing']
['lemon juice']
['water']
['dried mint, crushed']
['crumbled Blue cheese']


Thank you in advance!

Answer

You can use this pattern:

for ing in ingredients:
    print(re.search(r'[a-z][^.,]*(?![^,])(?i)', ing).group())

pattern details:

([a-z][^.,]*) # a substring that starts with a letter and that doesn't contain a period
                # or a comma
(?![^,]) # not followed by a character that is not a comma
         # (in other words, followed by a comma or the end of the string)
(?i)     # make the pattern case insensitive