How would I go about checking if a URL exists using Ruby?
For example, for the URL
https://google.com
the result should be truthy, but for the URLs
https://no.s开发者_Python百科uch.domain
or
https://stackoverflow.com/no/such/path
the result should be falsey
Use the Net::HTTP library.
require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
At this point res
is a Net::HTTPResponse object containing the result of the request. You can then check the response code:
do_something_with_it(url) if res.code == "200"
Note: To check for https
based url, use_ssl
attribute should be true
as:
require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
Sorry for the late reply on this, but I think this deserves a better answer.
There are three ways to look at this question:
- Strict check if the URL exist
- Check if you are requesting the URL correctly
- Check if you can request it correctly and the server can answer it correctly
1. Strict check if the URL exist
While 200
means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected
means that the URL exists and is redirecting to another one. While browsing, 302
many times behaves the same than 200
to the final user. Other status code that can be returned if a URL exists is 500 - internal server error
. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found
?
So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
false # false if can't find the server
end
2. Check if you are requesting the URL correctly
However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx
family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:
The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.
So the following code make sure the URL exists and you can access it:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
res.code[0] != "4" #false if http code starts with 4 - error on your side.
end
rescue Errno::ENOENT
false #false if can't find the server
end
3. Check if you can request it correctly and the server can answer it correctly
Just like the 4xx
family checks if you can access the URL, the 5xx
family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx
or 5xx
family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
end
rescue Errno::ENOENT
false #false if can't find the server
end
Net::HTTP
works but if you can work outside stdlib, Faraday is better.
Faraday.head(the_url).status == 200
(200 is a success code, assuming that's what you meant by "exists".)
Simone's answer was very helpful to me.
Here is a version that returns true/false depending on URL validity, and which handles redirects:
require 'net/http'
require 'set'
def working_url?(url, max_redirects=6)
response = nil
seen = Set.new
loop do
url = URI.parse(url)
break if seen.include? url.to_s
break if seen.size > max_redirects
seen.add(url.to_s)
response = Net::HTTP.new(url.host, url.port).request_head(url.path)
if response.kind_of?(Net::HTTPRedirection)
url = response['location']
else
break
end
end
response.kind_of?(Net::HTTPSuccess) && url.to_s
end
精彩评论