I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.
These are the structure rules to match the invalid URLs:
- a defined website
- zero, or more uppercase letters if zero uppercase letters after the pattern
- a pattern
- zero, or more uppercase letters if zero uppercase letters before the pattern
To be explicit with examples:
http://website/uppeRcase/pattern/upperCase // match it, uppercase before and after pattern
http://otherweb/WhatevercAse/pattern/whatevercase // do not match, no website
http://website/lowercase/pattern/lowercase // do not match, no uppercase before or after pattern
http://website/lowercase/pattern/uppercasE // match it, uppercase after pattern
http://website/Uppe开发者_开发知识库rcase/pattern/lowercase // match it, uppercase before pattern
http://website/WhatevercAse/asdasd/whatEveRcase // do not match it, no pattern
Thanks in advance for your help!
Mario
I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of
http://website@otherweb/uppeRcase/pattern/upperCase
is actually otherweb
, and should be excluded, even though it begins with website
. similarly:
http://website/actual/path/component?uppeRcase/pattern/upperCase
should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.
http://website/uppe%52case/%70attern/upper%43ase
is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.
Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.
To match uppercase letters you simply need [A-Z]
. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.
This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"
^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$
This expression will bath on either uppercase-case
^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$
UPDATE:
To @TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.
(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,@^%:/~\+#]*[\w\-\@^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?
*it worked in all my tests, but I can't claim I was exhaustive.
精彩评论