开发者

Import from web - set user agent in Mathematica

开发者 https://www.devze.com 2023-03-07 11:24 出处:网络
when I connect to my site with Mathermatica (Import[\"mysite\",\"Data\"]) and look at my Apache log I see:

when I connect to my site with Mathermatica (Import["mysite","Data"]) and look at my Apache log I see:

99.XXX.XXX.XXX - - [22/May/2011:19:36:2开发者_运维知识库8 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"

Could I set it to be something like this (when I connects with real browser):

99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"


As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.

You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.

EDIT The Squid http proxy seems to do what you want. It has the request_header_replace configuration directive which allows you to change the contents of request headers.


Here is a way to use the Apache HTTP client through JLink:

Needs["JLink`"]

ClearAll@urlString
urlString[userAgent_String, url_String] :=
  JavaBlock@Module[{http, get}
  , http = JavaNew["org.apache.commons.httpclient.HttpClient"]
  ; http@getParams[]@setParameter["http.useragent", MakeJavaObject@userAgent]
  ; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
  ; http@executeMethod[get]
  ; get@getResponseBodyAsString[]
  ]

You can use this function as follows:

$userAgent =
  "Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";

urlString[$userAgent, "http://www.htttools.com:8080/"]

You can feed the result to ImportString if desired:

ImportString[urlString[$userAgent, "mysite"], "Data"]

A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.

I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.

How It Works

Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.

As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.

The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that Import appears to use internally.

The whole body of urlString is wrapped in JavaBlock. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.

JavaNew is used to create the relevant Apache HTTP client objects, HttpClient and GetMethod. Java expressions like http.getParams() are expressed in JLink as http@getParams[]. The Java classes and methods are documented in the Apache HTTP client documentation.

The use of MakeJavaObject is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a Java Object is expected. If a Java String was expected, JLink would automatically create one. But JLink is unable to make this inference when Object is expected, so MakeJavaObject is used to give JLink a hint.

What about URLTools?

Incidentally, the first thing I tried to answer this question was to use Utilities`URLTools`FetchURL. It looked very promising since it takes an option called "RequestHeaderFields". Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.


I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.

Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]

I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.

Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.

curl is a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...), file transfer resume, proxy tunneling and a busload of other useful tricks.

From the curl and libcurl welcome page.


Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.

For calling C code within Mathematica, see the C Language Interface documentation


Mathematica 9 has the new URLFetch function. It has the option UserAgent.


You can also use J/Link to make your web requests or call curl or wget on the command line.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号