Ravenous Ravenous - 1 month ago 8
Ruby Question

scanning a webpage for urls with ruby and regex

I'm trying to create an array of all links found at the below url. Using

page.scan(URI.regexp)
or
URI.extract(page)
returns more than just urls.

How do I get just the urls?

require 'net/http'
require 'uri'

uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)

Answer

If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:

require 'nokogiri'
require 'open-uri'

# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))

# Extract all a-elements (HTML links)
all_links = doc.css('a')

# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
        sort.delete_if { |h| h.empty? }

# Print out some of them
puts links.grep(/store/)

http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...
Comments