blackbishop blackbishop - 5 months ago 14
SQL Question

Extract data from xml stored as nvarchar

I have to extract email

Subject
from an
XML
stored as
nvarchar
.

I'm using this query :

SELECT rtrim((SELECT CAST(
SUBSTRING(
[XML]
,patindex('%<SUBJECT>%', [XML])
,patindex('%</SUBJECT>%', [XML])-patindex('%<SUBJECT>%', [XML])+len('</SUBJECT>')
) as XML).value('(/SUBJECT/OPTION)[1]','nvarchar(2000)')
FROM dbo.Mails
)) as SUBJECT


What this query do is extracting at first the string which contains email subject (
<SUBJECT>....</SUBJECT>
) which I convert to XML and then get the Subject value using
value
function of XML.

This works fine, but in some cases the XML is not well formed and the parsing fails. Example :

DECLARE @XMLData XML = '<SUBJECT>
<OPTION CONSTRAINT="MASTER.IN_TITLE = '' OR MASTER.IN_LASTNAME = ''"><![CDATA[Découvrez nos offres de location]]>
</OPTION>
<OPTION CONSTRAINT="IN_TITLE <> '' AND MASTER.IN_LASTNAME <> ''"><![CDATA[~IN_TITLE~ ~IN_LASTNAME~, découvrez nos offres de location]]>
</OPTION>
</SUBJECT>'

select rtrim((@XMLData).value('(/SUBJECT/OPTION)[1]','nvarchar(2000)') )


Here in the
Option
attribute
Constraint
I have a special character
<
, If I try to escape this character it escapes all other character and I lost the XML structure. So how to escape it?

Another example is this :

DECLARE @XMLData XML = '<SUBJECT>
<OPTION NAME="DEFAULT"><![CDATA[~(IF((IN_TITLE<>'' AND IN_LASTNAME<>''),IN_TITLE&' '&IN_LASTNAME&',',''))~ nos plus belles réalisations de 2015]]>
</OPTION>
</SUBJECT>'

select rtrim((@XMLData).value('(/SUBJECT/OPTION)[1]','nvarchar(2000)') )


Here I got error near
&IN_LASTNAME&
but I thought we don't need to escape special character in
CDATA
!

Does anyone have a solution to avoid these erros in parsing?

Answer

How are these XMLs generated? Is this under your control? The three characters of evil "<, > and &" must be treated specially, either CDATA or escaped. If the XML is generated properly it should not be possible to get them in forbidden places...

Here are two working examples. The second is identical with Rhys Jones... In the first example I replace the "<>" with &lt;&gt;.

Btw: As you are dealing with other special characters obviously, you should mark your strings with N'string' to read it as unicode.

DECLARE @XMLData XML =  REPLACE(N'<SUBJECT>
      <OPTION CONSTRAINT="MASTER.IN_TITLE = ''  OR  MASTER.IN_LASTNAME = ''"><![CDATA[Découvrez nos offres de location]]>
      </OPTION>
      <OPTION CONSTRAINT="IN_TITLE <> '' AND MASTER.IN_LASTNAME <> ''"><![CDATA[~IN_TITLE~ ~IN_LASTNAME~, découvrez nos offres de location]]>
      </OPTION>
   </SUBJECT>','<>','&lt;&gt;');

select rtrim((@XMLData).value('(/SUBJECT/OPTION)[1]','nvarchar(2000)') );
GO

DECLARE @XMLData XML =  N'<SUBJECT>
      <OPTION NAME="DEFAULT"><![CDATA[~(IF((IN_TITLE<>'''' AND IN_LASTNAME<>''''),IN_TITLE&'' ''&IN_LASTNAME&'','',''''))~ nos plus belles réalisations de 2015]]>
      </OPTION>
   </SUBJECT>'

select rtrim((@XMLData).value('(/SUBJECT/OPTION)[1]','nvarchar(2000)') )
Comments