str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
The following regex matches each ingredient string and stores wanted informations into groups:
It defines 3 groups each separated from other by spaces:
^marks the string start
(\d+)is the first group and looks for at least one digit
\s+is the first separation between groups and looks for at least one white character
([A-Za-z ]+)is the second group and looks for a least one alphabetical character or space
\s+is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*)is the third group and looks for at least one digit with eventually a decimal point and some other digits
$marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re TEST_DATA = ['Table: Waiter: kenny', '======================================', '1 SAUSAGE WRAPPED WITH B 10.00', '1 ESCARGOT WITH GARLIC H 12.00', '1 PAN SEARED FOIE GRAS 15.00', '1 SAUTE FIELD MUSHROOM W 9.00', '1 CRISPY CHICKEN WINGS 7.00', '1 ONION RINGS 6.00', '----------------------------------', 'TOTAL 59.00', 'CASH 59.00', 'CHANGE 0.00', 'Signature:__________________________', 'Thank you & see you again soon!'] INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$') TOTAL_RE = re.compile(r'^TOTAL (.+)$') ingredients =  total = None for string in TEST_DATA: match = INGREDIENT_RE.match(string) if match: ingredients.append(match.groups()) continue match = TOTAL_RE.match(string) if match: total = match.groups() break print(ingredients) print(total)
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')] 59.00
Edit on Python raw strings:
r character before a Python string indicates that it is a raw string, which means that spécial characters (like
\n, etc...) are not interpreted.
To be clear, and for example, in a standard string
\t is one tabulation character. It a raw string it is two characters:
r'\t' is equivalent to