In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.
My actual Regex is:
(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)
开发者_JAVA百科
This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332
I just want to get in this case the URL without the ?efdf=332
at the end.
So how should I change the regex?
http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+
does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?
.
In C#:
Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")
That said, I'm not sure this is such a good way of matching URLs (what about https
, ftp
, mailto
etc.?)
You can use the Uri
class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.
精彩评论