Killerpixler Killerpixler - 21 days ago 7
Ruby Question

Anemone Crawler skip_links_like not obeyed

I am using

Anemone
to crawl a massive site that to make things worse has the same content on a few different language versions.

There is
domain.com/
for the main language and
domain.com/de/
,
domain.com/es/
for the other languages so I decided to exclude these in the crawl like so:

crawler = Anemone::Core.new('http://domain.com', opts = {skip_query_strings: true})
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)


However when looking at what is being crawled via a
puts page.url
in the
on_every_page do |page|
block I can see that it is still crawling all the many language variations.

I've even tried to include this

crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}


To remove the language links from what is being considered next in the list of pages to crawl.

Any suggestions?

Answer

Turns out the skip_links_like method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:

crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

I had to use this:

crawler.skip_links_like(/(^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

or just the REGEX differences:

Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*

Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*