I have a list of approximately 4300 URL's, all very similiar. It is likely that a few of them have been removed and I wish to identify which ones are no longer valid. I'm not interested in the content (a开发者_高级运维t this point in time), only if when used in the real world if they would currently return valid content (http 200) or doesn't exist (http 404). Essentially, I'm looking for a URL ping service. This is a one-off excercise.
If there aren't any existing tools specifcially for this purpose, I'm very comfortable in Java and could code my own solution. However, I don't want to reinvent the wheel and I'm not sure how best to do this without it looking like a denial of service attack. Would it be acceptable to hit each URL in turn, one immediately after the other (so no concurrent requests)? I'm very conscious of not putting undue strain on the target server.
Many thanks for any ideas or suggestions.
wget conveniently returns 0 for 200, and a nonzero return value for 404, thus the following would work:
for i in $(cat listOfUrls.txt); do
wget --quiet $i && echo $i >> goodUrls.txt || echo $i >> badUrls.txt;
done
or some close variant.
Consider:
- sleeping for, say, 1s between requests
- randomising listOfUrls.txt using, say
sort -R
, to spread multiple requests to the same server over time
There is no 100% solution for this issue. For example, if the response status is determined on PHP side, it usually will give you contents together with the status whatever request headers you send.
Still you could play with "range" request headers to ask for the first bytes of the content, still this must be supported by the server backend.
精彩评论