vanneto vanneto - 5 months ago 35
Python Question

scrapy - Get final redirected URL

I am trying to get the final redirected URL in scrapy. For example, if an anchor tag has a specific format:

<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />


Then I need to get the URL that URL redirects to (if it does, if its 200 then OK). For example, I get the appropriate anchor tags like this:

def parse (self, response)
hxs = HtmlXPathSelector (response);
anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

// Lets assume anchor contains the actual link (http://...)
for anchor in anchors:
final_url = get_final_url (anchor); // << I would need something like this

// Save final_url


So if I visited
http://www.example.com/index.php
and that would send me through 10 redirects and finally it would stop at
http://www.example.com/final.php
- this is what I would need
get_final_url()
to return.

I thought of hacking my way to a solution but am asking here to see if scrapy has one already provided?

Answer

Again, assuming anchor contains an actual URL, I went and accomplished it with urllib2:

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = urllib2.open(anchor, None, 1).geturl()

        // Save final_url

urllib2.open() returns a file-like object with two additional methods, one of them being geturl() which returns the final URL (after all redirects have been followed). Its not part of Scrapy, but it works.