Zoba Zoba - 2 months ago 9
JSON Question

filter invalid values in json string

I'm getting a string in a html body that I am trying to process into valid json. The string I receive isn't a valid json string and contains the following schema:

äÄ
"key1": " 10",
"key2": "beigef}gtem Zahlschein",
"key3": " G E L \ S C H T",
"key4": "M}nchen",
"key5": "M{rz",
"key6": "[huus"
Ü
ä


I've written a function to replace all the faulty characters to create a valid json-string, but how do i do the reverse without destroying the letters needed in json?

This is how I replaced the characters:

private static string FixChars(string input)
{
if (!string.IsNullOrEmpty(input))
{
if (input.Contains("["))
{
input = input.Replace("[", "Ä");
}
if (input.Contains(@"\"))
{
input = input.Replace(@"\", "Ö");
}
if (input.Contains("]"))
{
input = input.Replace("]", "Ü");
}
if (input.Contains("{"))
{
input = input.Replace("{", "ä");
}
if (input.Contains("|"))
{
input = input.Replace("|", "ö");
}
if (input.Contains("}"))
{
input = input.Replace("}", "ü");
}
if (input.Contains("~"))
{
input = input.Replace("~", "ß");
}
//DS_Stern hat Probleme beim xml erstellen gemacht
//if (input.Contains("*"))
//{
// input = input.Replace("*", "Stern");
//}
}
return input;
}


Then I've tried to deserialize the json-array into an Dictionary like this:

deserializedRequest = JsonConvert.DeserializeObject<Dictionary<string, string>[]>(json);


How do I access the different dictionaries, use my FixChars-method on the values and reserialize a valid json-string from that?

dbc dbc
Answer

It looks as though you decoded a stream or byte array that had been encoded in UTF-8 or, possibly, ISO-8859-1 using Encoding.GetEncoding("x-IA5-German"). To provide this, I created the following test app:

        var original = "{}[]";
        foreach (var encoding in Encoding.GetEncodings())
        {
            var s = encoding.GetEncoding().GetString(Encoding.UTF8.GetBytes(original));
            if (s == "äüÄÜ")
            {
                Console.WriteLine(string.Format("Match found for encoding display name {0} (code page {1})", encoding.Name, encoding.CodePage));
            }
        }

The only match reported was:

Match found for encoding display name x-IA5-German (code page 20106)

And similarly for ISO-8859-1:

Console.WriteLine(Encoding.GetEncoding("x-IA5-German").GetString(Encoding.GetEncoding("ISO-8859-1").GetBytes("{}[]")));

Prints äüÄÜ.

You should decode the stream with the appropriate encoding to avoid the problem. Possibly the web page has the wrong character set in its header?

Update

If we make the further assumption that the HTML was not encoded using UTF-8 or ISO-8859-1 or anything else common, and was then incorrectly decoded as well by your code, we need to search for pairs of encodings that produce the wrong result, like so:

var original = "{}[]";
var target = "äüÄÜ";

foreach (var toEncoding in Encoding.GetEncodings())
    foreach (var fromEncoding in Encoding.GetEncodings())
    {
        var s = toEncoding.GetEncoding().GetString(fromEncoding.GetEncoding().GetBytes(original));
        if (s == target)
        {
            Console.WriteLine(string.Format("Match Found: Encoding via {0} and decoding via {1}", fromEncoding.Name, toEncoding.Name));
        }
    }

This produces 147 matches, including many involving x-IA5-German such as:

Match Found: Encoding via iso-8859-1 and decoding via x-IA5-German
Match Found: Encoding via utf-8 and decoding via x-IA5-German

But also throws up a bunch of weird matches using IBM encodings, like:

Match Found: Encoding via IBM01141 and decoding via IBM037
Match Found: Encoding via IBM273 and decoding via IBM037

For the full list see this fiddle.

So if I try to fix your JSON by reverse-encoding it as follows:

    private static void TestJsonFix()
    {
        TestJsonFix("IBM01141", "IBM037");
    }

    private static void TestJsonFix(string toEncodingName, string fromEncodingName)
    {
        var json = @"äÄ
    ""key1"": ""  10"",
    ""key2"": ""beigef}gtem Zahlschein"",
    ""key3"": ""     G E L \ S C H T"",
    ""key4"": ""M}nchen"",
    ""key5"": ""M{rz"",
    ""key6"": ""[huus""
Ü
ä";
        Console.WriteLine(string.Format("Testing re-encoding from \"{0}\" to \"{1}\"", fromEncodingName, toEncodingName));
        var fixedJson = Encoding.GetEncoding(toEncodingName).GetString(Encoding.GetEncoding(fromEncodingName).GetBytes(json));
        Console.WriteLine(fixedJson);
    }   

I get the following result, which looks plausible:

{[
    "key1": "  10",
    "key2": "beigefügtem Zahlschein",
    "key3": "     G E L Ö S C H T",
    "key4": "München",
    "key5": "März",
    "key6": "¬huus"
]
{

So, could you be using the wrong IBM encoding when decoding your HTML from your Unisys A-Series type of machine (cobol74)?

Update 2

If I run the search on on the full actual and desired JSON, I get a much smaller set of possible encodings, all IBM related:

Match Found: Encoding via IBM01141 and decoding via IBM500
Match Found: Encoding via IBM273 and decoding via IBM500
Match Found: Encoding via IBM01141 and decoding via IBM870
Match Found: Encoding via IBM273 and decoding via IBM870
Match Found: Encoding via IBM500 and decoding via IBM01141
Match Found: Encoding via IBM870 and decoding via IBM01141
Match Found: Encoding via IBM01145 and decoding via IBM01141
Match Found: Encoding via IBM01148 and decoding via IBM01141
Match Found: Encoding via IBM284 and decoding via IBM01141
Match Found: Encoding via IBM01141 and decoding via IBM01145
Match Found: Encoding via IBM273 and decoding via IBM01145
Match Found: Encoding via IBM01141 and decoding via IBM01148
Match Found: Encoding via IBM273 and decoding via IBM01148
Match Found: Encoding via IBM500 and decoding via IBM273
Match Found: Encoding via IBM870 and decoding via IBM273
Match Found: Encoding via IBM01145 and decoding via IBM273
Match Found: Encoding via IBM01148 and decoding via IBM273
Match Found: Encoding via IBM284 and decoding via IBM273
Match Found: Encoding via IBM01141 and decoding via IBM284
Match Found: Encoding via IBM273 and decoding via IBM284
Found 20 matches

Then if I add the following extension method:

public static class TextExtensions
{
    public static string Reencode(this string s, Encoding toEncoding, Encoding fromEncoding)
    {
        return toEncoding.GetString(fromEncoding.GetBytes(s));
    }
}

I can fix your JSON as follows:

var fixedJson = json.Reencode(Encoding.GetEncoding("IBM500"), Encoding.GetEncoding("IBM273"));

If you have a JSON sample on your Unisys computer containing a character, you could include that in the JSON to further narrow down the possibilities.