mistadobalina mistadobalina - 2 months ago 7
SQL Question

Using SQL to retrieve XML attribute values

Let's say I've got a document that has a corresponding XML metadata file as shown below. This XML file contains index fields pertaining to the document:

<Document>
<Indices>
<IndexField>
<indexName>DOCID</indexName>
<indexValue>49626502</indexValue>
</IndexField>
<IndexField>
<indexName>EMPLOYEEID</indexName>
<indexValue>248572405</indexValue>
</IndexField>
<IndexField>
<indexName>LASTNAME</indexName>
<indexValue>BROWN</indexValue>
</IndexField>
<IndexField>
<indexName>FIRSTNAME</indexName>
<indexValue>RALPH</indexValue>
</IndexField>
<IndexField>
<indexName>CITY</indexName>
<indexValue>PORTLAND</indexValue>
</IndexField>
<IndexField>
<indexName>STATE</indexName>
<indexValue>OR</indexValue>
</IndexField>
</Indices>
</Document>


I've loaded the XML file into a SQL table where I will then extract the attribute values and load them into another table. I've got thousands of these files. The way the metadata files were generated, if the source system did not have a field populated, let's say CITY or STATE, the XML tag was not created in the file. The challenge I am running into is the metadata files lack uniformity or consistency because one may have more index values than another (based on what was populated vs. blank in the source system).

Here is how I am extracting the attributes to load into a different table:

SELECT

DOCID = CASE WHEN XMLDATA.exist('/Document/Indices/IndexField[indexName="DOCID"]') = 1 then XMLData.value('(//*[local-name()="indexValue"])[1]','varchar(max)') else NULL end,

EMPLOYEEID = CASE WHEN XMLDATA.exist('/Document/Indices/IndexField[indexName="EMPLOYEEID"]') = 1 then XMLData.value('(//*[local-name()="indexValue"])[2]','varchar(max)') else NULL end,

LASTNAME = CASE WHEN XMLDATA.exist('/Document/Indices/IndexField[indexName="LASTNAME"]') = 1 then XMLData.value('(//*[local-name()="indexValue"])[3]','varchar(max)') else NULL end,

FIRSTNAME = CASE WHEN XMLDATA.exist('/Document/Indices/IndexField[indexName="FIRSTNAME"]') = 1 then XMLData.value('(//*[local-name()="indexValue"])[4]','varchar(max)') else NULL end


For each field, I am first running this to make sure it exists in the XML file:

XMLDATA.exist('/Document/Indices/IndexField[indexName="DOCID"]') = 1

Then, I am pulling a positional value for indexValue:

XMLData.value('(//*[local-name()="indexValue"])[1]','varchar(max)')

The issue I am running into is if there are XML tags missing from the file, it throws off the positional indexValue for subsequent fields.

My question is- based on the XML format provided, how can I extrapolate the indexValue for a given indexName?

Answer

My suggestion: Use old-fashionded-pivot with GROUP BY and MAX(). Missing values will just appear as NULL.

The CTE DervivedTable will first create a normal table with row-wise data. The rest is kind of pivot:

DECLARE @tbl TABLE(ID INT,XmlData XML);
INSERT INTO @tbl VALUES
(1,'<Document>
    <Indices>
        <IndexField>
            <indexName>DOCID</indexName>
            <indexValue>49626502</indexValue>
        </IndexField>
        <IndexField>
            <indexName>EMPLOYEEID</indexName>
            <indexValue>248572405</indexValue>
        </IndexField>
        <IndexField>
            <indexName>LASTNAME</indexName>
            <indexValue>BROWN</indexValue>
        </IndexField>
        <IndexField>
            <indexName>FIRSTNAME</indexName>
            <indexValue>RALPH</indexValue>
        </IndexField>
        <IndexField>
            <indexName>CITY</indexName>
            <indexValue>PORTLAND</indexValue>
        </IndexField>
        <IndexField>
            <indexName>STATE</indexName>
            <indexValue>OR</indexValue>
        </IndexField>
    </Indices>
</Document>')
,(2,'<Document>
    <Indices>
        <IndexField>
            <indexName>DOCID</indexName>
            <indexValue>2222 id</indexValue>
        </IndexField>
        <IndexField>
            <indexName>EMPLOYEEID</indexName>
            <indexValue>2222 emp</indexValue>
        </IndexField>
        <IndexField>
            <indexName>LASTNAME</indexName>
            <indexValue>222 last</indexValue>
        </IndexField>
        <IndexField>
            <indexName>FIRSTNAME</indexName>
            <indexValue>222 first</indexValue>
        </IndexField>
        <IndexField>
            <indexName>CITY</indexName>
            <indexValue>222 city</indexValue>
        </IndexField>
        <IndexField>
            <indexName>STATE</indexName>
            <indexValue>222 state</indexValue>
        </IndexField>
    </Indices>
</Document>');

--The query

WITH DerivedTable AS
(
    SELECT ID
          ,f.value('indexName[1]','nvarchar(max)') AS indexName
          ,f.value('indexValue[1]','nvarchar(max)') AS indexValue
    FROM @tbl AS tbl
    CROSS APPLY tbl.XmlData.nodes('/Document/Indices/IndexField') AS A(f)
)
SELECT ID 
      ,MAX(CASE WHEN indexName='DOCID' THEN indexValue END) AS DOCID
      ,MAX(CASE WHEN indexName='EMPLOYEEID' THEN indexValue END) AS EMPLOYEEID
      ,MAX(CASE WHEN indexName='LASTNAME' THEN indexValue END) AS LASTNAME
      ,MAX(CASE WHEN indexName='FIRSTNAME' THEN indexValue END) AS FIRSTNAME
      ,MAX(CASE WHEN indexName='CITY' THEN indexValue END) AS CITY
      ,MAX(CASE WHEN indexName='STATE' THEN indexValue END) AS [STATE]
FROM DerivedTable
GROUP BY ID

The result

+----+----------+------------+----------+-----------+----------+-----------+
| ID | DOCID    | EMPLOYEEID | LASTNAME | FIRSTNAME | CITY     | STATE     |
+----+----------+------------+----------+-----------+----------+-----------+
| 1  | 49626502 | 248572405  | BROWN    | RALPH     | PORTLAND | OR        |
+----+----------+------------+----------+-----------+----------+-----------+
| 2  | 2222 id  | 2222 emp   | 222 last | 222 first | 222 city | 222 state |
+----+----------+------------+----------+-----------+----------+-----------+

UPDATE

You would reach the same with normal PIVOT:

WITH DerivedTable AS
(
    SELECT ID
          ,f.value('indexName[1]','nvarchar(max)') AS indexName
          ,f.value('indexValue[1]','nvarchar(max)') AS indexValue
    FROM @tbl AS tbl
    CROSS APPLY tbl.XmlData.nodes('/Document/Indices/IndexField') AS A(f)
)
SELECT p.*
FROM
(
    SELECT * FROM DerivedTable
) AS tbl
PIVOT
(
    MAX(indexValue) FOR indexName IN(DOCID,EMPLOYEEID,LASTNAME,FIRSTNAME,CITY,STATE)
) AS p