t3chb0t t3chb0t - 1 month ago 9
C# Question

How to parse an escape sequence?

I'm writing a parser for my own markup and I need to handle a few escape sequences but I'm not sure which strategy I should choose.

In particular I have two in my mind.

Here's an example

foo \\\<bar baz
with two of them:
\\
and
\<
.

When I now scan the string char by char


  1. should I detect the backslash
    \
    and then check if the next character is an excapable one or

  2. should I check for the character and then look back to see whether it's preceded by a backslash
    \
    ?



Are there any major (dis)advantages in either one?

Answer

Don't do either. Option #2 is really bad because when you look back at the previous character, and it's a backslash, how do you know if it was an escaped backslash, or if it's really escaping the current character?

You need to know where you're at. The way to do that is a state machine. If you're only doing \r, \t, \n, \", and \\, you can get by with a very simple one. Like this (fiddle here):

public static class StringExtensions
{
    private enum UnescapeState
    {
        Unescaped,
        Escaped
    }

    public static String Unescape(this String s)
    {
        var sb = new System.Text.StringBuilder();
        UnescapeState state = UnescapeState.Unescaped;

        foreach (var ch in s)
        {
            switch (state)
            {
                case UnescapeState.Escaped:
                    switch (ch)
                    {
                        case 't':
                            sb.Append('\t');
                            break;
                        case 'n':
                            sb.Append('\n');
                            break;
                        case 'r':
                            sb.Append('\r');
                            break;

                        case '\\':
                        case '\"':
                            sb.Append(ch);
                            break;

                        default:
                            throw new Exception("Unrecognized escape sequence '\\" + ch + "'");

                        //  Finally, what about stuff like '\x0a'? That's a much more 
                        //  complicated state machine. When you see 'x' in Escaped state,
                        //  you transition to UnescapeState.HexDigit0, then either 
                        //  UnescapeState.HexDigit1 or throw an exception, etc. 
                        //  Wicked fun to write. 
                    }
                    state = UnescapeState.Unescaped;
                    break;

                case UnescapeState.Unescaped:
                    if (ch == '\\')
                    {
                        state = UnescapeState.Escaped;
                    }
                    else
                    {
                        sb.Append(ch);
                    }
                    break;
            }
        }

        if (state == UnescapeState.Escaped)
        {
            throw new Exception("Unterminated escape sequence");
        }

        return sb.ToString();
    }
}

The one big thing here, is you can have various "accumulator" variables like sb, or one to accumulate hex digits if you're doing hex char escapes like \x0a, but you have no flags. The state variable should be the only way you keep track of the state you're in. Just keep adding more states to the enum. Literally, the third hex digit and the fourth hex digit are different state values in the enum.

Follow that rule mindlessly and you can write amazingly complicated state machines with an amazingly low IQ and the attention span of a gnat (I'M THE PROOF) and not mess it up.