开发者

Reuse a Client TCP Socket for Multiple HTTP Connections

开发者 https://www.devze.com 2023-02-18 02:38 出处:网络
Greetings all. I am making a ANSI C multi-threaded www-crawler (HTTP 1.1 compatible) on Linux 2.6.29-3.ydl61.3 and fairly progressed well. I have \'000 of domains in a MySQL database to collect pages

Greetings all.

I am making a ANSI C multi-threaded www-crawler (HTTP 1.1 compatible) on Linux 2.6.29-3.ydl61.3 and fairly progressed well. I have '000 of domains in a MySQL database to collect pages from. I can open any/all the domains in the crawler in keep-alive mode, as desired. I use POSIX threading and there are no contentions or data races whatsoever.

While the target servers seem ready to allow me to issue multiple concurrent or sequential requests for pages on each server socket (since each server returns 'Connection: Keep-Alive' as expected), I cannot actually do so... I can only fetch one page per socket connection... i.e. I can write a typical HTTP GET request to the socket via a file-descriptor and read down the response. Then immediately after that, I can only write to the fd BUT NOT read anymore! So while I have multiple (some into hundreds) urls per domain... it seems that I have to keep recreating socket connections to the same servers for each write/read (extremely memory wasteful and slow) rather than create only one client TCP connection and keep reusing the fd/socket until am done with the domain.

See below a partial output of 'netstat --inet -a' (note that as undesired I have multiple local socket connections to the same domain - these are not concurrent per domain):

tcp 0 0 gcell1:38614 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell1:34678 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:34768 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:56085 www.hihostels.com:http CLOSE_WAIT tcp 0 0 gcell11:34661 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:34785 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:46660 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:34697 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:37510 www.kenic.or.ke:http CLOSE_WAIT tcp 0 0 gcell11:37516 www.kenic.or.ke:http CLOSE_WAIT tcp 0 0 gcell11:34710 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:34711 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:46677 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:56513 www.kenic.or.ke:http CLOSE_WAIT tcp 0 0 gcell11:57560 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:4663开发者_运维问答4 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:46607 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:46666 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:37526 www.kenic.or.ke:http CLOSE_WAIT tcp 0 0 gcell11:46673 67.225.194.54:http CLOSE_WAIT tcp 0 0 gcell11:34736 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:57557 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:56395 www.kenic.or.ke:http CLOSE_WAIT tcp 0 0 gcell11:34714 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:34669 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:34767 x2web02.myhosting.com:http CLOSE_WAIT tcp 0 0 gcell11:43381 ip-72-167-251-99.ip.se:http CLOSE_WAIT

The client socket is created as below (partial code only)

if((http_socket_fd = socket (PF_INET, SOCK_STREAM, IPPROTO_TCP))!=SKMG_FAILURE) //typical
...
fcntl(http_socket_fd,SOCK_NONBLOCK); //set to non-block
...
setsockopt(http_socket_fd, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen); //local TCP keep-alive used
...
while(connect(http_socket_fd, (struct sockaddr *)&http_name, sizeof (struct sockaddr_in)) == (-1))
...
return http_socket_fd;

After this I just use write/read on the fd. And it works perfectly BUT for ONLY one round trip.

1) How can I reuse http_socket_fd for each HTTP GET write/read per domain without needing to create a new local TCP socket for each url?? Merely passing http_socket_fd to every page fetch call per domain is exactly what has failed to work. [CRITICAL]

2) How can I make asynchronous requests to these servers on this one thread per socket per domain paradigm? I run 4 concurrent threads (my server is dual-threaded), i.e. 4 different concurrent domain fetches. [NON-CRITICAL]


The usual practice is to create one client socket per connection. It is also a bad idea to share sockets among threads.

Instead of writing your own HTTP client, have you ever considered using a library like libcurl that provide many advanced features? The libcurl site has a sample program that downloads content using multiple threads. Also have a look at ZeroMQ, a high-performance messaging framework. A ZeroMQ socket could be used to connect to multiple servers and efficiently download data. (See The Guide).

0

精彩评论

暂无评论...
验证码 换一张
取 消