I'm trying to fetch this page (it's in Chinese, sorry for that):
amazon(dot)cn/s?rh=n:663227051
using the following code:
import java.io.BufferedReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class Application {
public static void main(String[] args) throws IOException, InterruptedException {
final URL url = new URL("http://www.amazon.cn/s?rh=n:663227051");
final String agentString = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)";
URLConnection urlConnection = url.openConnection();
urlConnection.setRequestProperty("User-Agent", agentString);
InputStreamReader streamReader = new InputStreamReader(urlConnection.getInputStream());
BufferedReader reader = new BufferedReader(streamReader);
final String path = "d:\\desktop\\Test.html";
FileWriter writer = new FileWriter(path);
writer.write("");
String line;
while ((line = reader.readLine()) != null)
writer.append(line).a开发者_C百科ppend(System.getProperty("line.separator"));
writer.close();
}
}
But after running this code for several times, I found that I randomly get two different results (see screenshots here http://www.flickr.com/photos/31629891@N07/4173636464/)
No matter how many times I refresh this page in browser, it returns the same result.
I'm wondering why is this so?
Amazon goes to a lot of effort to tailor the search results to what the (potential) customer is likely to want to buy. All sorts of things happen that (to the outside observer) are not exactly predictable / explicable. I could say more ... but I think I'm still under an NDA.
In short, I'm not surprised that your application is seeing different results all the time.
EDIT: By the way, if you are screen-scraping the Amazon site for some reason, you should pay attention to the following excerpt from the "Conditions of Use" page:
Amazon grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of Amazon. This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon. You may not frame or utilize framing techniques to enclose any trademark, logo, or other proprietary information (including images, text, page layout, or form) of Amazon without express written consent. You may not use any meta tags or any other "hidden text" utilizing Amazon's name or trademarks without the express written consent of Amazon. Any unauthorized use terminates the permission or license granted by Amazon. You are granted a limited, revocable, and nonexclusive right to create a hyperlink to the home page of Amazon.com so long as the link does not portray Amazon, or its products or services in a false, misleading, derogatory, or otherwise offensive matter. You may not use any Amazon logo or other proprietary graphic or trademark as part of the link without express written permission.
In short, GET PERMISSION.
Seems to me like it is an Amazon issue. Maybe you should ask them about this.
You should examine the traffic being sent from your program and compare it to what the browser sends. Use Fiddler to capture the browser transaction and Wireshark to capture your program's transaction (or use Wireshark for both). You will probably find that there's a subtle difference that's causing the server to return different results, possibly having to do with cookies.
You can probably get rid of some of this variability by adding an HTTP Cache-Control: no-cache header to your request (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Otherwise your request may be satisfied by any of a number of intermediate HTTP caches along the route to Amazon's "origin web server", and these caches may each have different versions of the page depending on how long Amazon allows copies of the page to be cached. A web site gets much higher scalability if they sacrifice a bit of consistency for content that doesn't absolutely have to be up to date.
The same sacrifice of consistency for scalability holds true once your request enters an Amazon data center. It can be load-balanced to any free web server, and that web server in general can draw on different sources for the components on the page. Perhaps the difference is that the pages got assembled from parts stored on two different clusters of memcached (distributed in-memory cache) machines.
And on top of this, as @Stephen C alludes to, you may be seeing personalization effects.
精彩评论