Giorgio Gambino Giorgio Gambino - 4 months ago 22
PowerShell Question

Matching free() and malloc() calls with regular expressions

I'm creating a powershell script that parses a file containing C code and detects if it contains calls to free(), malloc() or realloc() functions.

file_one.c



int MethodOne()
{
return 1;
}
int MethodTwo()
{
free();
return 1;
}





file_two.c



int MethodOne()
{
//free();
return 1;
}
int MethodTwo()
{
free();
return 1;
}





check.ps1



$regex = "(^[^/]*free\()|(^[^/]*malloc\()|(^[^/]*realloc\()"
$file_one= "Z:\PATH\file_one.txt"
$file_two= "Z:\PATH\file_two.txt"

$contentOne = Get-Content $file_one -Raw
$contentOne -match $regex

$contentTwo = Get-Content $file_two-Raw
$contentTwo -match $regex





processing the whole file in a time seems to work well with contentOne,
in fact I get True (because of the free() in MethodTwo).
Processing contentTwo is not so lucky and returns False instead of True
(because of the free() in MethodTwo).


Can someone help me to write a better regex that works in both cases?

sln sln
Answer

Sure, this is it

Raw:

^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/"'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())

Stringed:

"^(?>(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n))|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\b(?:free|malloc|realloc)\\()[\\S\\s](?:(?!\\b(?:free|malloc|realloc)\\()[^/\"'\\\\])*))*(?:(\\bfree\\()|(\\bmalloc\\()|(\\brealloc\\())"

Verbatim:

@"^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:""[^""\\]*(?:\\[\S\s][^""\\]*)*""|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/""'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())"

Explained

 ^ 
 (?>
      (?:                              # Comments 
           /\*                              # Start /* .. */ comment
           [^*]* \*+
           (?: [^/*] [^*]* \*+ )*
           /                                # End /* .. */ comment
        |  
           //                               # Start // comment
           (?:                              # Possible line-continuation
                [^\\] 
             |  \\ 
                (?: \r? \n )?
           )*?
           (?: \r? \n )                     # End // comment
      )
   |                                 # OR,

      (?:                              # Non - comments 
           "
           [^"\\]*                          # Double quoted text
           (?: \\ [\S\s] [^"\\]* )*
           "
        |  '
           [^'\\]*                          # Single quoted text
           (?: \\ [\S\s] [^'\\]* )*
           ' 
        |                                 # OR,

           (?!                              # ASSERT: Here, cannot be free / malloc / realloc {}
                \b 
                (?: free | malloc | realloc )
                \(
           )
           [\S\s]                           # Any char which could start a comment, string, etc..
                                            # (Technically, we're going past a C++ source code error)

           (?:                              # -------------------------
                (?!                              # ASSERT: Here, cannot be free / malloc / realloc {}
                     \b 
                     (?: free | malloc | realloc )
                     \(
                )

                [^/"'\\]                         # Char which doesn't start a comment, string, escape,
                                                 # or line continuation (escape + newline)
           )*                               # -------------------------
      )                                # Done Non - comments 
 )*

 (?:
      ( \b free\( )                    # (1), Free()
   |  
      ( \b malloc\( )                  # (2), Malloc()
   |  
      ( \b realloc\( )                 # (3), Realloc()
 )

Some notes:

This only finds the first one from the beginning of string using ^ anchor.
To find them all, just remove the ^ from the regex.

This works because it matches everything up to what you're looking for.
In this case, what it found is in capture group 1, 2, or 3.

Good Luck !!


What the regex contains:

----------------------------------
 * Format Metrics
----------------------------------
Atomic Groups       =   1

Cluster Groups      =   10

Capture Groups      =   3

Assertions          =   2
       ( ? !        =   2

Free Comments       =   25
Character Classes   =   12

edit
Per request, explanation of the part of the regex that handles
/**/ comments. This -> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/

This is a modified unrolled-loop regex that assumes an opening delimiter
of /* and a closing one of */.
Notice that the open/close share a common character / in it's delimiter
sequence.
To be able to do this without lookaround assertions, a method is used
to shift the trailing delimiter's asterisk inside the loop.
Using this factoring, all that's needed is to check for a closing /
to complete the delimited sequence.

 /\*              # Opening delimiter /*

 [^*]*            # Optionally, consume all non-asterisks

 \*+              # This must be 1 or more asterisks anchor's or FAIL.
                  # This is matched here to align the optional loop below
                  # because it is looking for the closing /.

 (?:              # The optional loop part
      [^/*]            # Specifically a single non / character (nor asterisk).
                       # Since a / will be the next closing delimiter, it must be excluded.

      [^*]*            # Optional non-asterisks.
                       # This will accept a / because it is supposed to consume ALL
                       # opening delimiter's as it goes
                       # and will consider the very next */ as a close.

      \*+              # This must be 1 or more asterisks anchor's or FAIL.
 )*               # Repeat 0 to many times.

 /                # Closing delimiter /
Comments