richiepop2 richiepop2 - 4 years ago 165
R Question

XML to data frame - R

I'm having difficulty getting a XML file into R as a data frame. The data I working with is free to download from The Securities and Exchange Commission and can be found HERE.

The first downloadable option

SEC Investment Advisor Report (37 MB)
is the one I'll use as an example. After I download the file, I unzip it in my terminal (mac) with the following code:
gunzip -d IA_FIRM_SEC_Feed_02_05_2017.xml.gz
. This unzips into XML format. And from here, I've been struggling to get this into an R data frame. I've tried different variations of code using the XML package, but can only get one variable or two variables into the data frame. Here is an example of what the data looks like when I open the XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2017-02-05">
<Firms>
<Firm>
<Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>
<MainAddr Strt1="280 S 400 W" Strt2="SUITE 220" City="SALT LAKE CITY" State="UT" Cntry="United States" PostlCd="84101" PhNb="801.746.7171" FaxNb="801.363.5886"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2004-04-07"/>
<NoticeFiled>
<States RgltrCd="AZ" St="FILED" Dt="2009-12-10"/>
<States RgltrCd="UT" St="FILED" Dt="2009-12-10"/>
</NoticeFiled>
<Filing Dt="2016-11-09" FormVrsn="10/2012"/>
<FormInfo>
<Part1A>
<Item1 Q1I="Y" Q1M="N" Q1N="N" Q1O="N">
<WebAddrs>
<WebAddr>HTTP://WWW.JACOBSENCAPITAL.COM</WebAddr>
</WebAddrs>
</Item1>
<Item2A Q2A1="Y" Q2A2="N" Q2A3="N" Q2A4="N" Q2A5="N" Q2A6="N" Q2A7="N" Q2A8="N" Q2A9="N" Q2A10="N" Q2A11="N" Q2A12="N" Q2A13="N"/>
<Item2B/>
<Item3A OrgFormNm="Limited Liability Company"/>
<Item3B Q3B="NOVEMBER"/>
<Item3C StateCD="UT" CntryNm="United States"/>
<Item5A TtlEmp="4"/>
<Item5B Q5B1="2" Q5B2="0" Q5B3="4" Q5B4="0" Q5B5="0" Q5B6="0"/>
<Item5C Q5C1="26-100" Q5C2="0"/>
<Item5D Q5D1A="Up to 10 percent" Q5D1B="76-99 percent" Q5D1C="0 percent" Q5D1D="0 percent" Q5D1E="0 percent" Q5D1F="0 percent" Q5D1G="Up to 10 percent" Q5D1H="Up to 10 percent" Q5D1I="0 percent" Q5D1J="0 percent" Q5D1K="0 percent" Q5D1L="0 percent" Q5D1M="0 percent" Q5D2A="Up to 25 percent" Q5D2B="More than 75 percent" Q5D2C="0 percent" Q5D2D="0 percent" Q5D2E="0 percent" Q5D2F="0 percent" Q5D2G="Up to 25 percent" Q5D2H="Up to 25 percent" Q5D2I="0 percent" Q5D2J="0 percent" Q5D2K="0 percent" Q5D2L="0 percent" Q5D2M="0 percent"/>
<Item5E Q5E1="Y" Q5E2="Y" Q5E3="N" Q5E4="Y" Q5E5="N" Q5E6="N" Q5E7="N"/>
<Item5F Q5F1="Y" Q5F2A="190000000" Q5F2B="0" Q5F2C="190000000" Q5F2D="342" Q5F2E="0" Q5F2F="342"/>
<Item5G Q5G1="Y" Q5G2="Y" Q5G3="N" Q5G4="N" Q5G5="N" Q5G6="Y" Q5G7="Y" Q5G8="N" Q5G9="N" Q5G10="N" Q5G11="N" Q5G12="N"/>
<Item5H Q5H="26-50"/>
<Item5I Q5I1="N" Q5I2="N"/>
<Item5J Q5J="N"/>
<Item6A/>
<Item6B Q6B1="N" Q6B3="N"/>
<Item7A Q7A1="N" Q7A2="N" Q7A3="N" Q7A4="N" Q7A5="N" Q7A6="N" Q7A7="N" Q7A8="N" Q7A9="N" Q7A10="Y" Q7A11="N" Q7A12="N" Q7A13="N" Q7A14="N" Q7A15="N" Q7A16="N"/>
<Item7B Q7B="N"/>
<Item8A Q8A1="N" Q8A2="Y" Q8A3="N"/>
<Item8B Q8B1="N" Q8B2="N" Q8B3="N"/>
<Item8C Q8C1="Y" Q8C2="Y" Q8C3="N" Q8C4="N"/>
<Item8D/>
<Item8E Q8E="N"/>
<Item8F/>
<Item8G Q8G1="N"/>
<Item8H Q8H="N"/>
<Item8I Q8I="N"/>
<Item9A Q9A1A="N" Q9A1B="N"/>
<Item9B Q9B1A="N" Q9B1B="N"/>
<Item9C Q9C1="N" Q9C2="N" Q9C3="N" Q9C4="N"/>
<Item9D Q9D1="N" Q9D2="N"/>
<Item9E/>
<Item9F Q9F="1"/>
<Item10A Q10A="N"/>
<Item11 Q11="N"/>
<Item11A Q11A1="N" Q11A2="N"/>
<Item11B Q11B1="N" Q11B2="N"/>
<Item11C Q11C1="N" Q11C2="N" Q11C3="N" Q11C4="N" Q11C5="N"/>
<Item11D Q11D1="N" Q11D2="N" Q11D3="N" Q11D4="N" Q11D5="N"/>
<Item11E Q11E1="N" Q11E2="N" Q11E3="N" Q11E4="N"/>
<Item11F Q11F="N"/>
<Item11G Q11G="N"/>
<Item11H Q11H1A="N" Q11H1B="N" Q11H1C="N" Q11H2="N"/>
</Part1A>
</FormInfo>
</Firm>
<Firm>


It looks like the variables are listed directly before the equals sign and the content(row item) is listed directly after the equals sign. For example in the line:
<Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>


I would want the data frame to look as follows:

SECRgnCD FirmCrdNB SECNb BusNm
DRO 116085 801-63010 JACOBSEN CAPITAL MANAGEMENT


Thanks for your help.

Answer Source

The following code should work;

 require(XML)
 data <- xmlParse('./IA_FIRM_SEC_Feed_02_05_2017.xml')
 result <- as.data.frame(t(xmlSApply(data["//Firms/Firm/Info"],xmlAttrs)),
                    stringsAsFactors=FALSE)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download