开发者

Fastest way to ping thousands of websites using PHP

开发者 https://www.devze.com 2023-01-30 07:22 出处:网络
I\'m currently pinging URLs using CURL + PHP. But in my script, a r开发者_StackOverflow社区equest is sent, then it waits until the response comes, then another request, ... If each response takes ~3s

I'm currently pinging URLs using CURL + PHP. But in my script, a r开发者_StackOverflow社区equest is sent, then it waits until the response comes, then another request, ... If each response takes ~3s to come, in order to ping 10k links it takes more than 8 hours!

Is there a way to send multiple requests at once, like some kind of multi-threading?

Thank you.


USe the curl_multi_* functions available in curl. See http://www.php.net/manual/en/ref.curl.php

You must group the URLs in smaller sets: Adding all 10k links at once is not likely to work. So create a loop around the following code and use a subset of URLS (like 100) in the $urls variable.

$all = array();
$handle = curl_multi_init();
foreach ($urls as $url) {
    $all[$url] = curl_init();
    // Set curl options for $all[$url]
    curl_multi_add_handle($handle, $all[$url]);
}
$running = 0;
do {
    curl_multi_exec($handle, $running;);
} while ($running > 0);
foreach ($all as $url => $curl) {
    $content = curl_multi_getcontent($curl);
    // do something with $content
    curl_multi_remove_handle($handle, $curl);
}
curl_multi_close($handle);


First off I would like to point out that this is not a basic task which you can do on any kind of shared hosting provider. I assume you will get banned for sure.

So I assume you are able to compile software(VPS?) and start long running processes in the background(using php cli). I would use a redis(I liked predis as PHP client library very much) to push messages on a list. (P.S: I would prefer to write this in node.js/python(explanation below works for PHP), because I think this task can be coded in these languages pretty fast. I am going to try and write it and post code on github later.)

Redis:

Redis is an advanced key-value store. It is similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists, sets, and ordered sets. All this data types can be manipulated with atomic operations to push/pop elements, add/remove elements, perform server side union, intersection, difference between sets, and so forth. Redis supports different kind of sorting abilities.

Then start a couple of worker processes which will take(blocking if none available) messages from the list.

Blpop:

This is where Redis gets really interesting. BLPOP and BRPOP are the blocking equivalents of the LPOP and RPOP commands. If the queue for any of the keys they specify has an item in it, that item will be popped and returned. If it doesn't, the Redis client will block until a key becomes available (or the timeout expires - specify 0 for an unlimited timeout).

Curl is not exactly pinging(ICMP Echo), but I guess some servers could block these requests(security). I would first try to ping(using nmap snippet part) the host, and fail back to curl if ping fails, because pinging is faster then using curl.

Libcurl:

A free client-side URL transfer library, supporting FTP, FTPS, Gopher (protocol), HTTP, HTTPS, SCP, SFTP, TFTP, TELNET, DICT, FILE, LDAP, LDAPS, IMAP, POP3, SMTP and RTSP (the last four—only in versions newer than 7.20.0 or 9 February 2010)

Ping:

Ping is a computer network administration utility used to test the reachability of a host on an Internet Protocol (IP) network and to measure the round-trip time for messages sent from the originating host to a destination computer. The name comes from active sonar terminology. Ping operates by sending Internet Control Message Protocol (ICMP) echo request packets to the target host and waiting for an ICMP response.

But then you should do a HEAD request and only retrieve headers to check if host is up. Otherwise you would also be downloading content of url(takes time/cost bandwidth).

HEAD:

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

Then each worker process should use curl_multi. I think this link might provide a good implementation of this(minus it does not do head request). to have some sort of concurrency in each process.


You can either fork your php process using pcntl_fork or look into curl's built-in multi-threading. https://web.archive.org/web/20091014034235/http://www.ibuildings.co.uk/blog/archives/811-Multithreading-in-PHP-with-CURL.html


PHP doesn't have true multi-thread capabilities.

However, you could always make your CURL requests asynchronously.

This would allow you to fire off batches of pings instead of one at a time.

Reference: How do I make an asynchronous GET request in PHP?

Edit: Just keep in mind your gonna have to make your PHP wait until all responses come back before terminating.

  • Christian


curl has the "multi request" facility which is essentially a way of doing threaded requests. Study the example on this page: http://www.php.net/manual/en/function.curl-multi-exec.php


You can use the PHP exec() function to execute unix commands like wget to accomplish this.

exec('wget -O - http://example.com/url/to_ping /dev/null 2>&1 &');

It's by no means an ideal solution but does get the jobs done and by sending the output to /dev/null and running it in the background you can move onto the next "ping" without having to wait for the response.

Note: Some servers have exec() disabled for security purposes.


I would use system() and execute the ping script as a new process. Or multiple processes.

You can make a centralized queue with all addresses to ping, then kick of some ping scripts on the task.

Just note:

If a program is started with this function, in order for it to continue running in the background, the output of the program must be redirected to a file or another output stream. Failing to do so will cause PHP to hang until the execution of the program ends.


To handle this kind of tasks try out I/O multiplexing strategies. In a nutshell, the idea is that you create a bunch of sockets, feed them to your OS (say, using epoll on linux / kqueue on FreeBSD) and sleep until an event occurs on some of the sockets. Your OS's kernel can handle hundreds or even thousands of sockets in parallel in a single process.

You can not only handle TCP sockets but also deal with timers / file descriptors in a similar fashion in parallel.

Back to PHP, check out something like https://github.com/reactphp/event-loop which exposes a good API and hides lots of low-level details.


Run multiple php processes.

Process 1: pings sites 1-1000

Process 2: pings sites 1001-2001

...

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号