Fabio Fabio - 3 months ago 17
Vb.net Question

Remove tag from string variable using Regex

Application have the

string
variable which contains
xml
data.

I trying to remove all tags
<product_desc></product_desc>
using
Regex
.

Here are the value of the
string
variable

<orderlines>
<orderline>
<id>1000001</id>
<product_id>2004</product_id>
<product_desc>ITEM2004
Color: red
Size: 150x10x10
Material: iron
</product_desc>
<qnt>2</qnt>
</orderline>
<orderline>
<id>1000002</id>
<product_id>2012</product_id>
<product_desc>ITEM2012</product_desc>
<qnt>4</qnt>
</orderline>
<orderline>
<id>1000003</id>
<product_id>3000</product_id>
<product_desc>DELIVERY</product_desc>
<qnt>1</qnt>
</orderline>
</orderlines>


When I using next pattern:

Dim pattern As String = "(<product_desc>[\s\S]*</product_desc>)"
Dim newvalue As String = Regex.Replace(originvalue, pattern, "")


I get result like this:

<orderlines>
<orderline>
<id>1000001</id>
<product_id>2004</product_id>

<qnt>1</qnt>
</orderline>
</orderlines>


So problem is that
Regex
matches all values between first
<product_desc>
and last
</product_desc>
and replace them with empty string. This approach remove all
<orederline>
tags between them(check value of the
<qnt>
tag).

Can anybody give some tip of how limit removing to remove only specific tag. Content of the tag can contain all possible characters, newlines and even html code.

Answer

The problem: [\s\S]* is greedy

It matches every single char to the end of the string, then the engine backtracks to allow </product_desc> to match. Therefore, there is one single match from the first opening tag to the last closing tag.

The solution (if we're doing regex): a lazy quantifier

With all the warnings and disclaimers about using regex to parse xml... You can do this:

  • Adding a ? to a quantifier makes it "lazy", so that it matches only as many chars as necessary.
  • You can use .*? in DOTALL mode (as in the sample code below) or [\s\S]*? (but there is no point).

Sample code

Dim ResultString As String
Try
    ResultString = Regex.Replace(SubjectString, "(?s)<product_desc>.*?</product_desc>", "")
Catch ex As ArgumentException
    'Syntax error in the regular expression
End Try

Reference