开发者

How to scrape images from a web site with javascript and servlets

开发者 https://www.devze.com 2022-12-18 16:56 出处:网络
I have a web page that has the following content (I\'ve changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical):

I have a web page that has the following content (I've changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical):

<HTML>
<BODY>

<script type="text/javascript" src="http://localhost/servlet?publicKey=abcdefg12345678&amp"></script>

</BODY>
</HTML>

The resulting page displays an image when viewed in a browser and I'm trying to scrape that image. After I 开发者_StackOverflowscrape the image I attempt to index the images (see www.tineye.com for the idea of image search engine) and store them. If anybody knows how to scrape images from such web sites please let me know.

Note: the src does not contain ANY information about the image... it only calls the given servlet with a public key as the parameter. What I've posted above is EXACTLY what I see when I click View->Page Source in my browser (Firefox). Of course I've changed the actual URL and the public key for privacy issues, otherwise everything is identical.

I've seem similar techniques used for some banners: http://coldjava.hypermart.net/servlets/banner.htm


The JavaScript is probably manipulating the DOM and adding an image. Therefore the image (.jpg, .png or .gif) should be somewhere inside the JavaScript file, and should look something like this:

var image = new Image("/path/to/image.jpg");

You can use Regular Expressions to filter the path and filename out of the javascript code.


Instead of saving a local copy of the HTML file, you should save a local copy of the JavaScript file to see how exactly it's adding the image to the HTML file's DOM. That should let you figure out how to construct requests to get the images you need.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号