What is the regular expression to get a token of a URL?_问答_开发者

What is the regular expression to get a token of a URL?

开发者 https://www.devze.com 2023-01-11 10:56 出处：网络

Say I have strings like these: bunch of other html<a href=\"http://domain.com/133742/The_Token_I_Want.zip\" more html and stuff

相关专题：regex

Say I have strings like these:

bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other ht开发者_StackOverflow中文版ml<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff

What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?

Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
                                     #######

This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:

#! /usr/bin/perl

$_ = "http://domain.com/133742/The_Token_I_Want.zip";    
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
  print "$6\n";
}
else {
  print "no match\n";
}

Output:

$ ./prog.pl
The_Token_I_Want

UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.

#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
                     "/([^.]+)"
                   //  ####### I CAN HAZ HASHDERLINE PLZ
                     "[^?#]*)(\\?([^#]*))?(#(.*))?");

  const char * const urls[] = {
    "http://domain.com/133742/The_Token_I_Want.zip",
    "http://domain.com/12345/another_token.zip",
    "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
  };

  BOOST_FOREACH(const char *url, urls) {
    std::cout << url << ":\n";

    std::string t;
    boost::cmatch m;
    if (boost::regex_match(url, m, token))
      t = m[6];
    else
      t = "<no match>";

    std::cout << "  - " << m[6] << '\n';
  }

  return 0;
}

Output:

http://domain.com/133742/The_Token_I_Want.zip:
  - The_Token_I_Want
http://domain.com/12345/another_token.zip:
  - another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
  - YET_ANOTHER_TOKEN

/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

Might want to add more characters to [a-zA-Z_]+

You can use:

(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.

Try this:

/(?:f|ht)tps?:/{2}(?:www.)?domain[^/]+.([^/]+).([^/]+)/i

/\w{3,5}:/{2}(?:w{3}.)?domain[^/]+.([^/]+).([^/]+)/i

First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.

Then:

The glib answer would be: