Verbal_Kint Verbal_Kint - 2 months ago 13
JSON Question

JSON escape double quotes

I know this title seems rather popular on here, but a quick browse through them usually involves situations where the asker has one isolated section of JSON.

There are situations where

"
is used to signify inches, or it wraps a phrase to signify a nickname of some sort, either way it appears in the value string of a JS object which is already wrapped in double quotes.

Here is an example of the JS object string I am having trouble with (I have working regex to double quote the keys and remove extra whitespace, but this is the scraped string in all of its glory):

'{\n\t\t\n\t\t\t\t\t\n\t\n\n\t\n\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16241885":{title: "Nosefrida Fridababy Windi Gas & Colic Relief", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"8650356":{title: "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 100 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16249889":{title: "Nosefrida Nasal Aspirator Replacement Filters", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"8650355":{title: "Babyganics Face- Hand & Baby Wipes- Fragrance Free- 40 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"15490928":{title: "BabyGanics Newborn Ultra Absorbent Jumbo Size Diapers - 36 Count", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"14712536":{title: "Marvel Superhero Bandages", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"16263505":{title: "Nosefrida "The Snotsucker" Nasal Aspirator", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t\t\t, \n\t\t\n\t\n\n\t\t\n\n\t\t \n\n\n\n
"14848093":{title: "Zarbee\'s Children\'s Cough Syrup - Grape", isIneligible: false, isDiscontinued: false, isLowInventory: false, isAllowed: true}
\n\n\t\n\n\t\t\n\t \n\n\t\t\n\t}'


I have tried,
json.dumps
on the string first but that just double escapes and needs a double
json.loads
which brings me back to square one. I have tried regex like this:

double_quotes_in_json = re.compile(r'(?<=:)(\s*"[^"]*)(")([^"]*)(")?(?=[^"]*",|"\s*\})')


def escape_double_quotes(jsn_string, pattern=double_quotes_in_json):
for match in pattern.finditer(jsn_string):
# current pattern only matches 1 instance of either one double quote in JSON value string
# (presumably signifying inches) or 1 instance of phrase wrapped in double quotes
# for something like nicknames
# matches will have either 3 or 4 groups, representing one of the 2 match types described above
groups_matched = len(match.groups())
entire_match = match.group()
if groups_matched == 3:
# we only matched one double quote
subbed_match = pattern.sub('$1\\$2$3', entire_match)
jsn_string = re.sub(entire_match, subbed_match, jsn_string)
elif groups_matched == 4:
# we matched a phrase wrapped in double quotes
subbed_match = pattern.sub('$1\\$2$3\\$4', entire_match)
jsn_string = re.sub(entire_match, subbed_match, jsn_string)
return jsn_string


And while this seems the most promising, it seems to re-insert the double quotes without the escape chars I have in the sub, while also not subbing back in the first group.(I have tried with and without a raw string in the sub function
r
) So for the above problem section (below is a substring):

"16263505":{title: "Nosefrida "The Snotsucker" Nasal Aspirator"


The pattern doesn't sub group 1 back in and for some reason subs in a single quote (below is a substring of the failed regex processing):

"16263505":{title: "The Snotsucker"' Nasal Aspirator"


Either way
json.loads
complains about the unescaped
"
.

Edit 1:
My regex can pull out the unescaped quotes but subbing it back in isn't behaving as expected, I am probably doing something stupid here and could use a fresh set of eyes.

example output of my function with print statements:

low_inventory = response.xpath(
'//script[contains(., "islistEligibility") or contains(., "ishlistEligibility")]/text()'
).re_first(r'(?s)(?<=registryWislistEligibilityMap)(?:\s*=\s*)(\{.+\})')

In [453]: for m in double_quotes_in_json.finditer(low_inventory):
...: groups_matched = len(m.groups())
...: print('groups: ', m.groups())
...: entire_match = m.group()
...: print('entire match: ', m.group())
...: if groups_matched == 3:
...: # we only matched a single double quote
...: subbed_match = double_quotes_in_json.sub(r'$1\\$2$3', entire_match)
...: print('subbed3: ', subbed_match)
...: jsn_string = re.sub(entire_match, subbed_match, jsn_string)
...: elif groups_matched == 4:
...: subbed_match = double_quotes_in_json.sub(r'$1\\$2$3\\\$4', entire_match)
...: print('subbed4: ', subbed_match)
...: jsn_string = re.sub(entire_match, subbed_match, jsn_string)
...: print(jsn_string)
...:
groups: (' "Nosefrida ', '"', 'The Snotsucker', '"')
entire match: "Nosefrida "The Snotsucker"
subbed4: "Nosefrida "The Snotsucker"
{ "16241885":{"title": "Nosefrida Fridababy Windi Gas &amp; Colic Relief", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "8650356":{"title": "Babyganics Face- Hand &amp; Baby Wipes- Fragrance Free- 100 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "16249889":{"title": "Nosefrida Nasal Aspirator Replacement Filters", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "8650355":{"title": "Babyganics Face- Hand &amp; Baby Wipes- Fragrance Free- 40 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "15490928":{"title": "BabyGanics Newborn Ultra Absorbent Jumbo Size Diapers - 36 Count", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "14712536":{"title": "Marvel Superhero Bandages", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "16263505":{"title": "The Snotsucker"' Nasal Aspirator", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true}, "14848093":{"title": "Zarbee's Children's Cough Syrup - Grape", "isIneligible": false, "isDiscontinued": false, "isLowInventory": false, "isAllowed": true} }

Answer

for some reason, using pythons builtin replace function achieved the desired result whereas re.sub did not properly escape the double quotes. (this was with using groups references in a raw string with a single escape or a regular string with double escapes). Either way, here is the working function. If someone has some insight as to why using replace works over re.sub I would be very interested into why this is.

(old code commented out)

double_quotes_in_json = re.compile(r'(?<=:)(\s*")([^"]*)(")([^"]*)(")?(?=[^"]*",|"\s*\})')


def escape_double_quotes(jsn_string, pattern=double_quotes_in_json):
    for match in pattern.finditer(jsn_string):
        # current pattern only matches 1 instance of either one double quote in JSON value string
        # (presumably signifying inches) or 1 instance of phrase wrapped in double quotes
        # for something like nicknames
        # matches will have either 3 or 4 groups, representing one of the 2 match types described above
        num_groups_matched = len(match.groups())
        groups = match.groups()
        entire_match = match.group()
        print('groups: ', match.groups())
        print('entire: ', entire_match)
        if num_groups_matched == 4:
            # we only matched one double quote
            # subbed_match = pattern.sub('$1$2\\$3$4', entire_match)
            # jsn_string = re.sub(entire_match, subbed_match, jsn_string)
            target = ''.join(groups[1:4])
            replaced = target.replace('"', '\\"')
            print(replaced)
            jsn_string = jsn_string.replace(target, replaced)
        elif num_groups_matched == 5:
            # we matched a phrase wrapped in double quotes
            # subbed_match = pattern.sub('$1$2\\$3$4\\$5', entire_match)
            # jsn_string = re.sub(entire_match, subbed_match, jsn_string)
            target = ''.join(groups[1:])
            replaced = target.replace('"', '\\"')
            print(replaced)
            jsn_string = jsn_string.replace(target, replaced)
    return jsn_string

Edit #1 (AKA: after some sleep approach):

double_quotes_in_title_attr = re.compile(
    r'(?<="title":)(?:\s*")(?P<value>.+?)(?=",\s*"\w+":|"\s*\})'
)


def escape_double_quotes_in_title(jsn_string, pattern=double_quotes_in_title_attr):
    for match in pattern.finditer(jsn_string):
        target = match.group('value')
        replaced = target.replace('"', '\\"')
        jsn_string = jsn_string.replace(target, replaced)
    return jsn_string

# use this first to properly quote keys so the above pattern will match
unquoted_key_pattern = re.compile(r'(?!")(\'?(?P<key>\w+)\'?)(?=:\s*(?:"|false|true|\d|\[|\{))')

def fix_json_keys(jsn, pattern=unquoted_key_pattern):
    return pattern.sub(r'"\g<key>"', jsn)

Thanks for the help @deceze.

Comments