Eric Eric - 5 months ago 11
Ruby Question

Ruby - Matching Twitter URL from any html page using Regex

I am trying to fetch the Twitter URL from this page for instance; however, my result is

nil
. I am pretty sure my regex is not too bad, but my code fails. Here is it :

doc = `(curl --url "http://www.rabbitreel.com/")`
twitter_url = ("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+").match(doc)
puts twitter_url
# => nil


Maybe, I misused regex syntax. My initial idea was simple: I wanted to match a regular Twitter url structure. I even tried http://rubular.com to test my regex, and it seemed to be fine when I entered a Twitter url.

Answer

http://ruby-doc.org/core-2.2.0/String.html#method-i-match

tells you that the object you're calling match on should be the string you're parsing, and the parameter should be the regex pattern. So if anything, you should call :

doc.match("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+")

I prefer

doc[/your_regex/]

syntax, because it directly delivers a String, and not a MatchData, which needs another step to get the information out of.

For Regexen, I always try to begin to begin as simple as possible

[3] pry(main)> doc[/twitter/]
=> "twitter"
[4] pry(main)> doc[/twitter\.com/]
=> "twitter.com"
[5] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[6] pry(main)> doc[/twitter\.com\/\//] #OOPS. One \/ too many
=> nil
[7] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[8] pry(main)> doc[/twitter\.com\/\S+/]
=> "twitter.com/rabbitreel\""
[9] pry(main)> doc[/twitter\.com\/[^"]+/]
=> "twitter.com/rabbitreel"
[10] pry(main)> doc[/http:\/\/twitter\.com\/[^"]+/]
=> nil
[11] pry(main)> doc[/https?:\/\/twitter\.com\/[^"]+/]
=> "https://twitter.com/rabbitreel"
[12] pry(main)> doc[/https?:\/\/twitter\.com\/[^" ]+/] #DONE
=> "https://twitter.com/rabbitreel"
Comments