Evan Benn Evan Benn - 9 months ago 43
Python Question

Extracting key value pairs from string with quotes

I am having trouble coding an 'elegant' parser for this requirement. (One that does not look like a piece of C breakfast). The input is a string, key value pairs separated by ',' and joined '='.


The part tricking me is values can be quoted (") , and inside the quotes ',' does not end the key.


This last part has made it tricky for me to use split or re.split, resorting to for i in range for loops :(.

Can anyone demonstrate a clean way to do this?

It is OK to assume quotes happen only in values, and that there is no whitespace or non alphanumeric characters.

Answer Source

I would advise against using regular expressions for this task, because the language you want to parse is not regular.

You have a character string of multiple key value pairs. The best way to parse this is not to match patterns on it, but to properly tokenize it.

There is a module in the Python standard library, called shlex, that mimics the parsing done by POSIX shells, and that provides a lexer implementation that can easily be customized to your needs.

from shlex import shlex

def parse_kv_pairs(text, item_sep=",", value_sep="="):
    """Parse key-value pairs from a shell-like text."""
    # initialize a lexer, in POSIX mode (to properly handle escaping)
    lexer = shlex(text, posix=True)
    # set ',' as whitespace for the lexer
    # (the lexer will use this character to separate words)
    lexer.whitespace = item_sep
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs)
    # (if your option key or value contains any unquoted special character, you will need to add it here)
    lexer.wordchars += value_sep
    # then we separate option keys and values to build the resulting dictionary
    # (maxsplit is required to make sure that '=' in value will not be a problem)
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

Example run :


Output :

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

EDIT: I forgot to add that the reason I usually stick with shlex rather than using regular expressions (which are faster in this case) is that it gives you less surprises, especially if you need to allow more possible inputs later on. I never found how to properly parse such key-value pairs with regular expressions, there will always be inputs (ex: A="B=\"1,2,3\"") that will trick the engine.

If you do not care about such inputs, (or, put another way, if you can ensure that your input follows the definition of a regular language), regular expressions are perfectly fine.

EDIT2: split has a maxsplit argument, that is much more cleaner to use than splitting/slicing/joining. Thanks to @cdlane for his sound input !