Fo. Fo. - 2 months ago 14
PHP Question

How to make this PHP URL parsing function nearly perfect?

This function is great, but its main flaw is that it doesn't handle domains ending with .co.uk or .com.au. How can it be modified to handle this?

function parseUrl($url) {
$r = "^(?:(?P<scheme>\w+)://)?";
$r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
$r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
$r .= "(?::(?P<port>\d+))?";
$r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
$r .= "(?:\?(?P<arg>[\w=&]+))?";
$r .= "(?:#(?P<anchor>\w+))?";
$r = "!$r!";

preg_match ( $r, $url, $out );

return $out;
}


To clarify my reason for looking for something other than parse_url() is that I want to strip out (possibly multiple) subdomains as well.

print_r(parse_url('sub1.sub2.test.co.uk'));


Results in:

Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)


What I want to extract is "test.co.uk" (sans subdomains), so first using parse_url is a pointless extra step where the output is the same as the input.

Answer

This may or may not be of interest, but here's a (somewhat monstrous) regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):

~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i

The named components are:

scheme
authority
  userinfo
    username
    password
  domain
  ip
path
query
fragment

And here's the code that generates it (along with variants defined by some options):

public static function validateUri($uri, &$components = false, $flags = 0)
{
    if (func_num_args() > 3)
    {
        $flags = array_slice(func_get_args(), 2);
    }

    if (is_array($flags))
    {
        $flagsArray = $flags;
        $flags = array();
        foreach ($flagsArray as $flag)
        {
            if (is_int($flag))
            {
                $flags |= $flag;
            }
        }
    }

    // Set options.
    $requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
    $requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
    $isRelative = (bool)($flags & self::URI_IS_RELATIVE);
    $requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);

    // And we're away…

    // Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
    $hex = '[\da-f]'; // Hexadecimal digit.
    $pct = "(?:%$hex{2})"; // "Percent-encoded" value.
    $gen = '[\[\]:/?#@]'; // Generic delimiters.
    $sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
    $reserved = "(?:$gen|$sub)"; // Reserved characters.
    $unreserved = '[\w.\~-]'; // Unreserved characters.
    $pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
    $qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.

    // Other entities.
    $octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
    $label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';

    $scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';

    // Authority components.
    $userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
    $ip = "(?P<ip>$octet.$octet.$octet.$octet)";
    if ($requireMultiPartDomain)
    {
        $domain = "(?P<domain>(?:$label\.)+(?:$label))";
    }
    else
    {
        $domain = "(?P<domain>(?:$label\.)*(?:$label))";
    }
    $host = "(?P<host>$domain|$ip)";
    $port = '(?::(?P<port>\d+))?';

    // Primary hierarchical URI components.
    $authority = "(?P<authority>$userInfo$host$port(?=/|$))";
    $path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";

    // Final bits.
    $query = "(?:\?(?P<query>$qfChar*?))?";
    $fragment = "(?:#(?P<fragment>$qfChar*))?";

    // Construct the final pattern.
    $pattern = '~^';

    // Only include scheme and authority if the path is not relative.
    if (!$isRelative)
    {
        if ($requireScheme)
        {
            // If the scheme is required, then the authority must also be there.
            $pattern .= $scheme . $authority;
        }
        else if ($requireAuthority)
        {
            $pattern .= "$scheme?$authority";
        }
        else
        {
            $pattern .= "(?:$scheme?$authority)?";
        }
    }
    else
    {
        // Disallow that optional slash we put in $path.
        $pattern .= '(?!/)';
    }

    // Now add standard elements and terminate the pattern.
    $pattern .= $path . $query . $fragment . '$~i';

    // Finally, validate that sucker!
    $components = array();
    $result = (bool)preg_match($pattern, $uri, $matches);
    if ($result)
    {
        // Filter out all of the useless numerical matches.
        foreach ($matches as $key => $value)
        {
            if (!is_int($key))
            {
                $components[$key] = $value;
            }
        }

        return true;
    }
    else
    {
        return false;
    }
}