mseancole mseancole - 3 months ago 14
PHP Question

Cross-reference streams are not supported yet

I'm new to the Zend Framework so my apologies if I'm missing something simple. However, I would have thought that code taken directly from the documentation would work. Instead I'm getting an uncaught exception.

Fatal error: Uncaught exception 'Zend_Pdf_Exception' with message 'Cross-reference streams are not supported yet.' in C:\xampp\php\zend\library\Zend\Pdf\Parser.php:318
Stack trace:
#0 C:\xampp\php\zend\library\Zend\Pdf\Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('116')
#1 C:\xampp\php\zend\library\Zend\Pdf.php(318): Zend_Pdf_Parser->__construct('PDF/Current...', Object(Zend_Pdf_ElementFactory_Proxy), true)
#2 C:\xampp\php\zend\library\Zend\Pdf.php(267): Zend_Pdf->__construct('PDF/Current...', NULL, true)
#3 C:\xampp\htdocs\test\test.php(7): Zend_Pdf::load('PDF/Current...')
#4 {main}
thrown in C:\xampp\php\zend\library\Zend\Pdf\Parser.php on line 318


I've been reading around looking for a possible solution to this, but have had little luck. This is the most similar and it does not solve my problem. From what I've read there, and from other sources, PDF versions 1.4 and older should work fine, but this is not the case here, and its years old. My PDF versions are all 1.4, so I'm not even sure how accurate that post is anyways. The code works for the PDF included in the demo, but not on any of the existing ones I'm trying to use. I would upload the PDF, but they are all confidential.

I'm only trying to get the metadata, but I am not even able to load the document. I started using a framework so I wouldn't have to create my own parser. If there is a simpler way to do this, or if someone can shed some light on this, I would be much obliged.

Edit: for clarification, I've tried both methods from linked documentation page. Neither works.

Answer

I ended up having to create my own parser for this. If anyone finds this and has any further suggestions or questions about how I did it just add a comment.

Solution

I'm not going to upload the whole code as its really long, very messy, and inefficient. I've grown a bit as a developer since the initial post and have been meaning to go back and take another swing at it. So I'll use this post to explain what I have, point out some of the problems and solutions I have found, as well as make some comments on how to make it more efficient. Hopefully this will make it easier for you, and hopefully this will inspire me to make some changes. Disclaimer: It has been months since I have last looked at this code, so don't expect me to remember everything. However, I was pretty good about documenting my code and findings (for once) so what I'm not remembering is mostly minor.

The most important thing I can tell you is to look at the raw XML, take notes, and compare a few of your files. Adobe apparently couldn't make up their mind when creating the metadata syntax, so you will end up having to add multiple checks for all the different revisions (I'll give an example later). Actually finding the metadata in the document is pretty easy. Adobe gives you a nice set of begin/end tags, so you just iterate over the document until you find them. Here's a cleaned up and generalized sample from one of the PDF's I'm parsing.

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        ">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
            <dc:format>application/pdf</dc:format>
            <dc:title>
                <rdf:Alt>
                    <rdf:li xml:lang="x-default">Title of Document</rdf:li>
                </rdf:Alt>
            </dc:title>
            <dc:creator>
                <rdf:Seq>
                    <rdf:li>Creator of Document (Not author)</rdf:li>
                </rdf:Seq>
            </dc:creator>
            <dc:description>
                <rdf:Alt>
                    <rdf:li xml:lang="x-default">Short description</rdf:li>
                </rdf:Alt>
            </dc:description>
        </rdf:Description>
        <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
            <xmp:CreateDate>2004-01-27T16:36:09Z</xmp:CreateDate>
            <xmp:CreatorTool>FrameMaker 7.0</xmp:CreatorTool>
            <xmp:ModifyDate>2012-02-20T15:55:19Z</xmp:ModifyDate>
        </rdf:Description>
        <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
            <pdf:Producer>Acrobat Distiller 9.4.5 (Windows)</pdf:Producer>
        </rdf:Description>
        <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
            <xmpMM:DocumentID>uuid:4eae0fcf-f493-4773-9473-f81c7491e8aa</xmpMM:DocumentID>
            <xmpMM:InstanceID>uuid:98209926-ba98-4ac7-a5f7-050050048f5d</xmpMM:InstanceID>
        </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

The best way to view the raw XML data is to download notepad++ (though you could use any notepad like program) and open up the PDF's in that. The first thing you will see is the PDF version, "%PDF-1.4" in this case, and then a lot of confusing looking characters. Ignore that, but note the PDF version. Notice the "xpacket" tags in the sample above, that's what you are going to need to look for every time you want to find the metadata. Just Ctrl+F to find "xmpmeta", the first occurrence should be your metadata. Word of caution: Don't attempt to use password protected documents. Everything is obfuscated, including the meta, this also means that PHP can't read it either. I believe there is an option to allow the reading of the meta in password protected PDF's, but I can't remember for sure, nor do I know if it actually works for PHP.

Just as you can Ctrl+F to find the meta in notepad++, you can do the same thing in PHP with fgets() and a while loop. Something I didn't do but would probably be a good idea to implement, is to determine which end of the document to start from. This isn't universal between all PDF versions, but same versions seem to be similarly placed. For instance, in PDF 1.4 they appear to all be closer to the bottom of the document, while in PDF 1.6 they are closer to the top. Again, you can check the PDF version from the first line. Reading the document with PHP should be pretty simple to set up, so I'm going to skip this bit of code. Though, I will point out that it is a good idea to quit the loop once you have found the entire metadata as this is a very processing intense operation so you'll want to save time where you can. I would also suggest only running this on groups of 10-20 files at a time, less if larger documents. Setting up a caching system helped me quite a bit with timeout errors.

After you've got the metadata in a string, then you'll want to clean it up a bit. The first thing you are going to want to do is make sure your metadata is wrapped up nicely in a single root node so that the XML parser can read it. There were a couple of instances where they weren't. The best/easiest way to fix this is to add a common wrapper. I would suggest using the most common one available to you. For me, that was the "xmpmeta" tag with an inner "rdf" wrapper. Ensuring that each metdata starts the same is important for navigating the document. There might be a better way of doing this, but this works and isn't too inefficient (at least now, after I removed the two loops).

if(strpos($xmlstr, 'xmpmeta') === FALSE) {
    if(strpos($xmlstr, 'rdf:rdf') === FALSE) { $xmlstr = "<rdf>$xmlstr</rdf>"; }
    $xmlstr = "<xmpmeta>$xmlstr</xmpmeta>";
}

Afterwards you are going to want to remove the namespaces. I tried using them, but its kind of hard to do so when the URLs keep changing in each implementation and you don't know for sure which ones you have. Besides, it was already starting to run slow and adding all that extra XML parsing would have only made it worse. It was just much simpler to remove them.

$nodesToRemove = array('rdf', 'pdf', 'xap', 'xapMM', 'xmp', 'xmpMM', 'dc', 'x');
foreach($nodesToRemove as $remove) { $xmlstr = str_replace("$remove:", '', $xmlstr); }
$xmlstr = preg_replace('/xmlns[^=]*="[^"]*"/i', '', $xmlstr);
$xmlstr = preg_replace("/xmlns[^=]*='[^']*'/i", '', $xmlstr);

$dom = new DOMDocument();
$dom->loadXML($xmlstr);
$sxe = simplexml_import_dom($dom);
$root = $dom->documentElement;
$namespaces = $sxe->getDocNamespaces(TRUE);

foreach($namespaces as $prefix => $uri) {
    $root->removeAttributeNS($uri, $prefix);
    $root->removeAttribute("xmlns:$prefix");
}

if($root->hasChildNodes()) {
    foreach($root->childNodes as $element) {
        if ($element->nodeType != XML_TEXT_NODE) {
            $this->_removeNS($element, $namespaces);
        }
    }
}

The $nodesToRemove might be a little different for you. Those are just all the namespaces I ran across. Note: I was having issues where the order in which you remove the nodes was important. I'm not sure why, but it would remove the "xmp" from "xmpMM" and I would be stuck with an "MM" namespace. The code above doesn't appear to have that issue, so I'm not sure if it still is an issue, but just in case, be wary. Either way, it isn't too hard to fix, just have PHP sort it then reverse it. The REGEX removes default namespace declarations. I tried a number of different ways to go about this, but this was the only one that I could find that consistently worked. There's probably a way to combine those two REGEX functions, but I'm completely lost when it comes to REGEX, and my attempts just left it broken. I'm not sure why I'm then removing the namespaces again with XML. This appears to be one of my more recent attempts at cleaning this up a bit, however this is from a working solution, so it doesn't hurt (at least not functionality). The first bit, besides the REGEX, can probably be removed and replaced with the XML solution, though I've not verified this. It's still necessary to remove the default namespaces before loading the string into XML because the XML parsers do not consider the "xmlns" attribute to be an actual attribute. The only reason the namespaced version "xmlns:$prefix" works is because they are not considered "xmlns" attributes but "xmlns:$prefix" attributes. Subtleties.

Don't be like me. Don't try to implement every version of PDF ever created. It CAN'T be done. Well... it probably can, but its more hassle than its worth. Luckily for me, these were all in-house documents, so when I reached my limit and was tired of tweaking it just to break something else, or lose compatibility that I previously had, I just had those last few documents converted. Find the most common versions and handle those, then the next most common and set up conditions for those, and so on. Once you get to a point where you've only a few left, have them updated, or just announce that you don't support this version. Especially if they are older. No sense in adding functionality for something that's only ever going to be used for just a few documents. One of the big ones I can remember is a situation where the "xpacket" was not always on its own line. Sometimes it shared space with a few metadata tags. This caused "missing" data, because I did not start recording the meta until after the "xpacket" was found. It seemed like a simple fix, but it uncovered a whole lot of issues, so I ended up just scrapping that revision altogether and having them updated. Luckily those were the last 3-4 files.

Once you have cleaned the metadata, then you are ready to parse it as XML. For example, here's how I get the description.

function getDescription($xml) {
    $return = 'Error: Metadata could not be retrieved';//Return value if metadata can not be parsed

    $sxe = new SimpleXMLElement($xml);

    $xpath = array(
        '//description/Alt/li',
        '//Description/Alt/li',
        '//xmpmeta/RDF/*[last()]',
        //'//Description/description',
    );
    foreach($xpath as $pattern) {
        $temp = $sxe->xpath($pattern);

        if( ! empty($temp)) {
            $return = isset($temp[0]->description) ? $temp[0]->description : $temp[0];
            break;
        }
    }

    //Return value if description was not found in metadata
    return empty($return) ? 'Error: Metadata "description" could not be retrieved' : strval($return);
}

There's a few things to note about this. The first is the array of XPATH's. These are those multiple conditions I was talking about earlier. You may also notice that commented out XPATH. That's one I am either still working on compatibility for, or have given up on. I don't remember, its been a while since I've had to look at this, and no one has complained about errors. So I'm assuming its not an issue. Another thing to notice is the amount of deviations for just this ONE field. The metadata changed quite a bit, and sometimes reverted. So you have to check for each case, make sure there were no other deviations, and then add any other conditions that may have occurred. Something to look into would be saving separate parsers based on version then loading the proper parser, may cut down on inefficiency. Looking back on this now, perhaps the easier way would have been to look up the standardization docs for each revision, but instead I ended up doing this mostly through trial and error. So, while this works for me, there may be some things I missed because it wasn't an issue in any of my documents. The other thing to note is how similar the tags are between the revisions. I wasn't, and still am not all that great with advanced XPATH, so maybe there is some better way to do this, I don't know.

I hope this helps somewhat. I know its given me a few ideas. If you have any other specific questions let me know.

Comments