I've been fiddling with TCP/IP networking in Gawk and am having a hard time figuring out why开发者_如何学Python it behaves well with some sites but not for others. I've even tried using HTTP Live Headers in Windows to try and debug what's going on, but to no avail.
The sample Gawk code below (Version 3.1.5) will work fine for the site www.sobell.com but will hang on www.drudgreport.com.
BEGIN {
print "Dumping HTML of www.sobell.com"
server = "/inet/tcp/0/www.sobell.com/80"
print "GET http://www.sobell.com" |& server
while ((server |& getline) > 0)
print $0
close(server)
print "Dumping HTML of www.drudgereport.com"
server = "/inet/tcp/0/www.drudgereport.com/80"
print "GET http://www.drudgereport.com" |& server
while ((server |& getline) > 0)
print $0
close(server)
}
I appreciate any help! Thanks All.
Your code (and the gawk manual) uses the outdated HTTP/0.9 syntax. Apparently the second server no longer supports this. Important differences:
- The lines must end with "\r\n" instead of plain UNIX newlines.
- You must end your request with an empty line.
- Add a version type (HTTP/1.0 or HTTP/1.1) to the end of the request line.
- Usually the request string does not contain the hostname, this is put on a separate "Host: " line.
The following code works for me:
BEGIN {
ORS = "\r\n"
server = "/inet/tcp/0/www.drudgereport.com/80"
print "GET / HTTP/1.1" |& server
print "Host: www.drudgereport.com" |& server
print "" |& server
while ((server |& getline) > 0)
print $0
close(server)
}
You can find all the gory details in RFC 1945 (1.0) and RFC 2616 (1.1).
精彩评论