MonsterMMORPG MonsterMMORPG - 3 months ago 33
C# Question

C# regex pattern to extract urls from given string - not full html urls but bare links as well

I need a regex which will do the following

Extract all strings which starts with http://
Extract all strings which starts with www.


So i need to extract these 2.

For example there is this given string text below

house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue


So from the given above string i will get

www.monstermmorpg.com
http://www.monstermmorpg.com
http://www.monstermmorpg.commerged


Looking for regex or another way. Thank you.

C# 4.0

Answer

You can write some pretty simple regular expressions to handle this, or go via more traditional string splitting + LINQ methodology.

Regex

Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b",
         RegexOptions.Compiled | RegexOptions.IgnoreCase);
string rawString = "house home go www.monstermmorpg.com nice hospital " +
    "http://www.monstermmorpg.com this is incorrect " + 
    "url http://www.monstermmorpg.commerged continue";

foreach(Match m in linkParser.Matches(rawString))
    MessageBox.Show(m.Value);

Explanation Pattern:

\b       -matches a word boundary (spaces, periods..etc)
(?:      -define the beginning of a group, the ?: specifies not to capture
          the data within this group.
https?://  - Match http or https (the '?' after the "s" makes it optional)
|        -OR
www\.    -literal string, match www. (the \. means a literal ".")
)        -end group
\S+      -match a series of non-whitespace characters.
\b       -match the closing word boundary.

Basically the pattern looks for strings that start with http:// OR https:// OR www. (?:https?://|www\.) and then matches all the characters up to the next whitespace.

Traditional String Options

string rawString = "house home go www.monstermmorpg.com nice hospital " +
    "http://www.monstermmorpg.com this is incorrect " + 
    "url http://www.monstermmorpg.commerged continue";

var links = rawString.Split("\t\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
  .Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://"));

foreach (string s in links)
    MessageBox.Show(s);