hjschmid hjschmid - 1 month ago 8x
Python Question

Scrapy SgmlLinkExtractor how to define rule with regex

I have a link like http://www.homegate.ch/kaufen/105975478

I only want to allow links that have "/kaufen/" in the url and which contain a 9 digit integer number at the end of the url.

I managed to allow only links containing "/kaufen/" with the following allow statement:

allow=('/kaufen/', )

How can I extend the allow statement such that it only follows the links having a 9 digit number at the end?


You can use \/kaufen\/[0-9]{9}

  • \/kaufen\/ means /kaufen/ litteraly
  • [0-9]{9} means 9 number chars


var re = /\/kaufen\/[0-9]{9}/gi; 
var str = 'http://www.homegate.ch/kaufen/105975478';
var m;
while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex) {
    // View your result using the m-variable.