开发者

Why does this regex check return true for this string?

开发者 https://www.devze.com 2023-02-09 09:27 出处:网络
I need a regex that will determine if a string is a tweet URL. I\'ve got this Regexp.new(/http:|https:\\/\\/(twitter\\.com\\/.*\\/status\\/.*|twitter\\.com\\/.*\\/statuses\\/.*|www\\.twitter\\.com\\/

I need a regex that will determine if a string is a tweet URL. I've got this

Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/开发者_运维问答.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i)

Why does it return true for the following?

"http://i.stack.imgur.com/QdOS0.jpg".match(Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i))? true : false
    => true


http: will always match a URL starting with http:

Try the following:

/https?:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i

The question mark will make the s optional, thus matching http or https.


Your regex could be abbreviated like :

#^https?://(:?www\.|mobile\.)?twitter\.com/.*?/status(:?es)?/.*#i

explanation:

#                       regex delimiter
^                       start of line
https?                  http or https
://                     ://
(:?                     start of non capture group
www\.|mobile\.          www. or mobile.
)?                      end of group
twitter\.com/           twitter.com
.*?                     any number of any char not greedy
/status                 /status
(:?es)?                 non capture group that contains possibly  `es`
/.*                     / followed by any number of any char
$                       end of string
#i                      delimiter and case insensitive


No need for regular expressions here (as usual).

require 'uri'
uri = URI.parse("http://www.twitter.com/status/12345")
p uri.host.split('.')[-2] == 'twitter' # returns true

More docs at: http://ruby-doc.org/stdlib/


You should group your OR-Clauses, like this:

(http:|https:)

Additionally, it wouldn't hurt to specify beginning and end of it:

^(http:|https:).*$


The start of your regex specifies an option of just 'http:', which naturally matches the URL you are testing. Depending on how strict you need your check to be, you could just remove the http/https parts from the start of the regex.


While many other answers show you a better regex, the answer is because /foo|bar/ will match either foo or bar, and what you wrote was /http:|.../, hence all URLs will be matched.

See @giraff's answer for how you could have written the alternation to do what you expect, or @M42's or @Koraktor's answers for a better regexp.

And as posted in the comments, note that you can write a regex literal as %r{...} instead of /.../, which is nice when you want to use / characters in your regex without escaping them.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号