revolution9540 revolution9540 - 2 months ago 7
Perl Question

Scripted downloading using Perl from SEC website producing encrypted documents unexpectedly

I have a script written in Perl that retrieves "Complete submission text files" of company filings such as 8-K and 10-K statements.

Even though these files are .txt files, they are actually HTML files. Here is an example of a header of one of these particular files:

<SEC-DOCUMENT>0001104659-12-008133.txt : 20120209
<SEC-HEADER>0001104659-12-008133.hdr.sgml : 20120209
<ACCEPTANCE-DATETIME>20120209153739


This particular file is located here: https://www.sec.gov/Archives/edgar/data/1001838/000110465912008133/0001104659-12-008133.txt

As you can see, when you click on the file you can read the contents. However, the Perl script that downloads this file produces a file with contents that look something like this:


±šÜϦ¨^Ÿ¨T¡š¯À½ä¾ ‹™R¶¦
[T)S,”à맖f+™•µ˜¨ÀáÊ’jv–Ié·T¡­Š¥é-“RZujKk)­š¦4¨”Þªkä‹·nðü³²¥XjSmYÎõ™?÷Úf™S[Ù}hV¢LšDÆ.zjyßýµ“Èo¿øÒ¾²›šúOª+&ò'äòDì?Sÿyúà¹ÛùDŽ\üŸr¡ZÙÉ”òðU†þ”Š+ä
¯RbZŸÎS{™òc‚ü"ð‹2@˜N”Òä'Õr¥”…_§ö€ÁV6Õín.SOsÙ|6¿MÁµì\™-•ŸqýÏ
ȈeâsSè!t/›'-)”zãöŸt¾ÞÊ–SpÓL¢Deòiò9Ç túNžq˜Ø¯fÊ1òSèg—
6àÞ™•Hü¹¨ R‰šÕýŒÛSƒ !NdeÙ~Ü~î¹¥/EäH˜…§%«ålJ%ÒéüµÆ
“©Pù)á8
tÚ¡ GÕ•Ü?L?ÿ†‡ß”«@Š—X– w¶bpq§Ég{#A>JœÂ»Ól‘¼ŽH,OÆþéùðû<ù•Ìò!1*†$¾ŸúO.‘Ýÿ™¶MÙoɦœýô§^²ï‰{@¼êk¦»D
?}";Êl˜—¿þ˜Â~š&Uødz$`¬<µ¶Ò'æ¦Öž$äÔ¡×:!ƒj$Ú:7µFèDî´®€ eŽ¿â"¿˜{« ×fÊ©R¶H8í\WÉæ¦ÖàëÜñ^Uêðž¢ÖVœ—k7zý‘j(­»øL&ª–g೺ö›2­Ç†Ÿ¹Õ[VèVijÇXEkª&•W¨’ÞTZ«¯~z£uÕÝè–¥7c¦ÞÐêÔƒÖªë–
rcضµÚýQC½µb-½¥ö>0´»û—ŸXzûï›öMÚJ½®µîbö-)VkQO¯ìFµ_6ɹmS1î´VŒüšüŸ!¦ï^εk¤¿”©ý\ÌÍP·J
^¼ê÷Ì‹±!¿u—/´nÌöêÚ ùf}m¥
mY²›¤4´»V|¦Y5žîñN«HgCö1ç‚Wmã?Ñ6öýÆ¢U7/&}¢a»aöªÿÞ¨Ã=«y°æ1P^»1Ö)˜!ÕR¶’


I am wondering if this is due to encryption happening on their end. From browsing their website, it does not appear like they admit to encrypting files being transmitted over a network. I personally do not think this is the case since I can download the file fine manually within my browser.

Here is a segmenet of the Perl code that is responsible for downloading the file and saving it to the hard drive:

@arraydata = split(/\,/, $datagn[$j]);
if($arraydata[2] =~ m/8K/ || $arraydata[2] =~ /8\-K/){

# Starts crawler, not checking for errors
my $mech = WWW::Mechanize->new( autocheck => 0 );

# Grabs address
@arraydatad = split(/\//, $arraydata[4]);

# Formats output file name
$filenamea = "Reports\\" . $dirname . "\\" . $cik . "\\" . $arraydatad[3];
chomp($filenamea);

# This is the file from the EDGAR archives
$filecrawl = "https://www.sec.gov/Archives/" . $arraydata[4];

# This crawls the file and saves it to the hard drive
$mech->get($filecrawl, ':content_file' => $filenamea);


The full code is here: http://pastebin.com/QXb1zcdv

Does anybody have an idea of why I am getting a file with nonsense in it when I am downloading with a Perl script from SEC.gov?

Answer

:content_file will dump the raw content to the specific file. If you do a print $ua->res->headers->as_string to get the HTTP header of the response you will see a Content-Encoding: gzip header. This means that is the content is compressed with gzip. A quick gzip -dc on your saved file will thus provide you with the decompressed file.

As ikegami mentioned in the comment you should use decoded_content instead to access the data. As the name implies this will not give you the raw content but will apply the necessary decoding, in this case the decompression:

my $ua = WWW::Mechanize->new();
$ua->get('https://www.sec.gov/Archives/edgar/data/1001838/000110465912008133/0001104659-12-008133.txt');
print $ua->res->decoded_content;