zhespelt zhespelt - 5 months ago 22
Python Question

How do I parse (potentially unknown) key/value pairs from a list of strings with default values?

I have a string column in a Pandas DataFrame that contains database settings and I want to turn the key/value pairs into new columns in the DataFrame, filling in any missing values with

True
.

Here's what one of the strings looks like:

Status=ONLINE, Updateability=READ_WRITE, UserAccess=MULTI_USER, Recovery=SIMPLE, Version=706, Collation=SQL_Latin1_General_CP1_CI_AS, SQLSortOrder=52, IsAutoCreateStatistics, IsAutoUpdateStatistics, IsFullTextEnabled


And I'd like the output to look something like:

| Status | Updateability | UserAccess | Recovery | Version | Collation | SQLSortOrder | IsAutoCreateStatistics | IsAutoUpdateStatistics | IsFullTextEnabled |
| ONLINE | READ_WRITE | MULTI_USER | SIMPLE | 706 | SQL_Latin1_General_CP1_CI_AS | 52 | True | True | True |


I've tried iterating through the rows, parsing the string with the
re.sub
and
split()
, which got me to the key/value pairs for settings which follow the
key=value
format. I hard-coded the known boolean keys (as shown in the code below), but I'd like to have this work on unknown booleans.

The point I'm getting stuck on now is rolling the results for a particular key/value pair back up to the correct column/row it's supposed to go in.

import re

_status = ['Status=ONLINE, Updateability=READ_WRITE, UserAccess=MULTI_USER, Recovery=SIMPLE, Version=706, Collation=SQL_Latin1_General_CP1_CI_AS, SQLSortOrder=52, IsAutoCreateStatistics, IsAutoUpdateStatistics, IsFullTextEnabled', 'Status=OFFLINE, Updateability=READ, UserAccess=SINGLE_USER, Recovery=SIMPLE, Version=706, Collation=SQL_Latin1_General_CP1_CI_AS, SQLSortOrder=53, IsAutoCreateStatistics, IsAutoUpdateStatistics, IsTornPageDetectionEnabled']

_status = [re.sub('IsAutoCreateStatistics', 'IsAutoCreateStatistics=True', _s) for _s in _status]
_status = [re.sub('IsAutoUpdateStatistics', 'IsAutoUpdateStatistics=True', _s) for _s in _status]
_status = [re.sub('IsFullTextEnabled', 'IsFullTextEnabled=True', _s) for _s in _status]
_status = [re.sub('IsTornPageDetectionEnabled', 'IsTornPageDetectionEnabled=True', _s) for _s in _status]

_rows = []

for _row in _status:
_kvs = []
for _kv in _row.split(','):
_key, _val = _kv.split('=')
_d = {_key.strip(): _val.strip()}
_kvs.append(_d)
_rows.append(_kvs)

_rows


Which nets me:

[[{'Status': 'ONLINE'},
{'Updateability': 'READ_WRITE'},
{'UserAccess': 'MULTI_USER'},
{'Recovery': 'SIMPLE'},
{'Version': '706'},
{'Collation': 'SQL_Latin1_General_CP1_CI_AS'},
{'SQLSortOrder': '52'},
{'IsAutoCreateStatistics': 'True'},
{'IsAutoUpdateStatistics': 'True'},
{'IsFullTextEnabled': 'True'}],
[{'Status': 'OFFLINE'},
{'Updateability': 'READ'},
{'UserAccess': 'SINGLE_USER'},
{'Recovery': 'SIMPLE'},
{'Version': '706'},
{'Collation': 'SQL_Latin1_General_CP1_CI_AS'},
{'SQLSortOrder': '53'},
{'IsAutoCreateStatistics': 'True'},
{'IsAutoUpdateStatistics': 'True'},
{'IsTornPageDetectionEnabled': 'True'}]]


This is the best I've come up with so far, and it looks like I'm close, but I'm sure there's a much more Pythonic way to do this.

Answer

You could simplify using str.partition,no need for any regex, just split the string :

s = "Status=ONLINE, Updateability=READ_WRITE, UserAccess=MULTI_USER, Recovery=SIMPLE, Version=706, Collation=SQL_Latin1_General_CP1_CI_AS, SQLSortOrder=52, IsAutoCreateStatistics, IsAutoUpdateStatistics, IsFullTextEnabled"

def repl(s):
    for sub in s.split(", "):
        k, _, v = sub.partition("=")
        yield (k, v) if v else (k, "True")

Any substring after splitting when partitioned on = that has an empty string for v means we don't have a pairing so we need no nor care about any prior knowledge of potential keys:

In [9]: s = "Status=ONLINE, Updateability=READ_WRITE, UserAccess=MULTI_USER, Recovery=SIMPLE, Version=706, Collation=SQL_Latin1_General_CP1_CI_AS, SQLSortOrder=52, IsAutoCreateStatistics, IsAutoUpdateStatistics, IsFullTextEnabled"

In [10]: dict(repl(s))
Out[10]: 
{'Collation': 'SQL_Latin1_General_CP1_CI_AS',
 'IsAutoCreateStatistics': 'True',
 'IsAutoUpdateStatistics': 'True',
 'IsFullTextEnabled': 'True',
 'Recovery': 'SIMPLE',
 'SQLSortOrder': '52',
 'Status': 'ONLINE',
 'Updateability': 'READ_WRITE',
 'UserAccess': 'MULTI_USER',
 'Version': '706'}

Also if the spaces are possibly not there you could just split on "," and strip the whitespace or use the csv lib settin skipinitialspace=True which will handle "foo, bar" and "foo,bar":

from csv import reader
from itertools import chain

def repl(s):
    for sub in chain(*reader([s], skipinitialspace=True)):
        k, _, v = sub.partition("=")
        yield (k, v) if v else (k, "True")


print(dict(repl(s)))

if you just wanted to create a new df and all keys are the same:

from csv import reader
from itertools import chain
import pandas as pd

def repl(s):
    d = {}
    for sub in chain(*reader([s], skipinitialspace=True)):
        k, _, v = sub.partition("=")
        d[k] = v if v else "True"
    return d




df =  pd.DataFrame.from_records(map(repl ,status))

print(df)

But as per the discussion, if you see a key later that you habe not seen in the first dict, you will get nan for the row value not True