I am trying to figure out if in C# if I have converted a webpage contents into a string, what is the best way to search for extensions. I am just looking to extract URLs within a webpage that ends in .html or .xhtml or edu. In which I don't care what the beginning looks like, which is better EndWith or Regex for finding this.
so if my input looked like this
string str = {var a,b=window.location.href.match(//webhp\?[^#]tune=[^#]/);if(a=b&&b.length>0?"http://www.google.com/logos/2011/lespaul.html"+b[
and i want to pull out http://www.google.com/logos/2011/lespaul.html store th开发者_JS百科at into an array
You should use an HTML parser such as sharp-query or HTML Agility Pack and never use regular expressions for parsing html or as the author of this post says some things might happen.
I could come up with this Regex: http:\/\/(.*?)(.html|.xhtml|.edu)
Edit Thanks to @Kakashi http:\/\/.*?\.(?:x?html|edu)
Try this:
var input = "string str = {var a,b=window.location.href.match(//webhp\\?[^#]tune=[^#]/);if(a=b&&b.length>0?\"http://www.google.com/logos/2011/lespaul.html";
var match = Regex.Match(input, @"https?:\/{2}[^\n]+\.(?:x?html|edu)");
Console.Write(match.Success? match.Groups[0].Value : "Not found"); //http://www.google.com/logos/2011/lespaul.html
精彩评论