Regular expression to match generic URL_问答_开发者

I've looked all over and have yet to find a single solution to address my need fo开发者_如何学Cr a regular expression pattern that will match a generic URL. I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings. Some examples:

http://localhost/mysite
https://localhost:55000
ftp://192.1.1.1
telnet://somesite/page.htm?a=1&b=2

Ideally, I'd like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.

(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.)

Appendix B of RFC 3986/STD 0066 (Uniform Resource Identifier (URI): Generic Syntax) provides the regular expression you need:

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression <n> as $<n>. For example, matching the above expression to
  http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
  $1 = http:
  $2 = http
  $3 = //www.ics.uci.edu
  $4 = www.ics.uci.edu
  $5 = /pub/ietf/uri/
  $6 = <undefined>
  $7 = <undefined>
  $8 = #Related
  $9 = Related
where <undefined> indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the five components as
  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9
Going in the opposite direction, we can recreate a URI reference from its components by using the algorithm of Section 5.3.

As for validating a URI against a particular scheme goes, you'll need to look at the RFC(s) describing the scheme(s) in which you are interested to get the detail required to validate that a URI is valid for the scheme it purports to be. The URI scheme registry is located at http://www.iana.org/assignments/uri-schemes.html.

And even then, you're doomed to some sort of failure. Consider the file: scheme. You can't validate that it represents a valid path in the file system of the authority (unless you are the authority). The best that you can do is validate that it represents something that looks like a valid path. And even then, a windows file: url like file:///C:/foo/bar/baz/bat.txt is (would be) invalid for anything but a server running some flavor of Windows. Any server running *nix would likely choke on it (what's a drive letter anyway?).

Nicholas Carey is correct to steer you towards RFC-3986. The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of "the wild" - it is too loose and matches just about any string including an empty string).

Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:

Regular Expression URI Validation

Regarding the subject of picking out URL's from the "wild", take a look at Jeff Atwood's "The Problem With URLs" and John' Gruber's "An Improved Liberal, Accurate Regex Pattern for Matching URLs" blog posts to get a glimpse as to some of the subtle problems which can arise. Also, you may want to take a look at a project I started last year: URL Linkification - this picks out unlinked HTTP and FTP URLs from text which may already have some links.

That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code:

function url_valid($url)
{
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme'])
    {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognised URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m{'regname'}) // If host regname specified, check for DNS conformance.
    {
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [0-9A-Za-z]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [0-9A-Za-z]          # Part first char is alphanum (no dash).
              [\-0-9A-Za-z]{0,61}  # Internal chars are alphanum plus dash.
              [0-9A-Za-z]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # One or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exqactly two alpha chars.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

The first regex validates the string as an absolute (has a non-empty host portion) generic URI. A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)

Note that the structure of this function allows easy expansion to include other schemes.

Would this be in Perl by any chance?

Try:

use strict;
my $url = "http://localhost/test";
if ($url =~ m/^(.+):\/\/(.+)\/(.+)/) {
    my $protocol = $1;
    my $domain = $2;
    my $dir = $3;

    print "$protocol $domain $dir \n";
}