Killerpixler Killerpixler - 1 year ago 131
Ruby Question

Anemone Crawler skip_links_like not obeyed

I am using

to crawl a massive site that to make things worse has the same content on a few different language versions.

There is
for the main language and
for the other languages so I decided to exclude these in the crawl like so:

crawler ='', opts = {skip_query_strings: true})

However when looking at what is being crawled via a
puts page.url
in the
on_every_page do |page|
block I can see that it is still crawling all the many language variations.

I've even tried to include this

crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}

To remove the language links from what is being considered next in the list of pages to crawl.

Any suggestions?

Answer Source

Turns out the skip_links_like method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:


I had to use this:


or just the REGEX differences:

Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*

Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*