开发者

TCP Networking in Gawk Works for Some Addresses but not Others

开发者 https://www.devze.com 2023-01-14 20:03 出处:网络
I\'ve been fiddling with TCP/IP networking in Gawk and am having a hard time figuring out why开发者_如何学Python it behaves well with some sites but not for others.I\'ve even tried using HTTP Live Hea

I've been fiddling with TCP/IP networking in Gawk and am having a hard time figuring out why开发者_如何学Python it behaves well with some sites but not for others. I've even tried using HTTP Live Headers in Windows to try and debug what's going on, but to no avail.

The sample Gawk code below (Version 3.1.5) will work fine for the site www.sobell.com but will hang on www.drudgreport.com.

BEGIN {
print "Dumping HTML of www.sobell.com"

server = "/inet/tcp/0/www.sobell.com/80"
print "GET http://www.sobell.com" |& server
while ((server |& getline) > 0)
    print $0
close(server)

print "Dumping HTML of www.drudgereport.com"

server = "/inet/tcp/0/www.drudgereport.com/80"
print "GET http://www.drudgereport.com" |& server
while ((server |& getline) > 0)
    print $0
close(server)

}

I appreciate any help! Thanks All.


Your code (and the gawk manual) uses the outdated HTTP/0.9 syntax. Apparently the second server no longer supports this. Important differences:

  • The lines must end with "\r\n" instead of plain UNIX newlines.
  • You must end your request with an empty line.
  • Add a version type (HTTP/1.0 or HTTP/1.1) to the end of the request line.
  • Usually the request string does not contain the hostname, this is put on a separate "Host: " line.

The following code works for me:

BEGIN {
    ORS = "\r\n"
    server = "/inet/tcp/0/www.drudgereport.com/80"
    print "GET / HTTP/1.1" |& server
    print "Host: www.drudgereport.com" |& server
    print "" |& server
    while ((server |& getline) > 0)
        print $0
    close(server)
}

You can find all the gory details in RFC 1945 (1.0) and RFC 2616 (1.1).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号