StoutPanda StoutPanda - 5 months ago 34
Ruby Question

Nokogiri / XML - Pulling Data from tags based on other tags?

I have the following example document:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
<MailingAddressGrp>
<USAddressGrp>
<AddressLine1Txt>Test</AddressLine1Txt>
<irs:CityNm>test</irs:CityNm>
<USStateCd>TX</USStateCd>
<irs:USZIPCd>test</irs:USZIPCd>
</USAddressGrp>
</MailingAddressGrp>
</EmployeeInfoGrp>
<ALEContactPhoneNum>test</ALEContactPhoneNum>
<EmployeeOfferAndCoverageGrp>
<AnnualOfferOfCoverageCd>1A</AnnualOfferOfCoverageCd>
<MonthlyOfferCoverageGrp/>
<AnnlShrLowestCostMthlyPremAmt/>
<MonthlyShareOfLowestCostMonthlyPremGrp/>
<AnnualSafeHarborCd>2C</AnnualSafeHarborCd>
<MonthlySafeHarborGrp/>
</EmployeeOfferAndCoverageGrp>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
<MailingAddressGrp>
<USAddressGrp>
<AddressLine1Txt>Test</AddressLine1Txt>
<irs:CityNm>test</irs:CityNm>
<USStateCd>TX</USStateCd>
<irs:USZIPCd>test</irs:USZIPCd>
</USAddressGrp>
</MailingAddressGrp>
</EmployeeInfoGrp>
<ALEContactPhoneNum>test</ALEContactPhoneNum>
<EmployeeOfferAndCoverageGrp>
<AnnualOfferOfCoverageCd>1A</AnnualOfferOfCoverageCd>
<MonthlyOfferCoverageGrp/>
<AnnlShrLowestCostMthlyPremAmt/>
<MonthlyShareOfLowestCostMonthlyPremGrp/>
<AnnualSafeHarborCd>2C</AnnualSafeHarborCd>
<MonthlySafeHarborGrp/>
</EmployeeOfferAndCoverageGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>




Using nokogiri I would like to be able to pull out the value in between the tags PersonFirstNm,PersonLastNm and irs:SSN for each Form1095CUpstreamDetail based on the RecordId.

I've tried removing namespaces as well. I have posted a small snippet as an example of what I've started with, but I have tried many iterations of working through the XML with no success in obtaining the fields I need based on the record id. (First time using XML, so I realize I am likely missing something easy but I have read through nokogiri docs several times with no success or improvement of my understanding.)

Anytime I set my xpath:

require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')


I do not seem to have any association between the RecordId and the tags mentioned above, and I am stuck on where to go next.

The fields are not listed as children for the RecordId, so I can't think of how to approach obtaining their values. I am including the full document as an example to make sure I am not excluding anything.

(Ultimately, I have an array of values, and I would like to pull the three tags mentioned above if the RecordId is contained within the array of numbers.)

Answer

First of all the xml validator reports error

The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:us:gov:treasury:irs:ext:aca:air:7.0'.

so you must set this default xmlns to "".

You can use this code.

require 'nokogiri'

doc = Nokogiri::XML(open('1094C_Request.xml'))

doc.namespaces['xmlns'] = ''

details = doc.xpath("//:Form1095CUpstreamDetail")

elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]

output = details.each_with_object({}) do |element, exp|
  exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
    exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
  end
end

output

p output
# {
#   "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
#   "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }

I hope this helps

Comments