开发者

Scraping pages with asynchronous responses with Hpricot

开发者 https://www.devze.com 2023-01-04 10:19 出处:网络
I\'m trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronous开发者_C百科ly, e.g. the results from a search on the apple website: http://www.ap

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronous开发者_C百科ly, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global

Any ideas on how I can successfully grab the results from the search with hpricot?

Thanks.


When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.

Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.

I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:

http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk

Notice the search string is in the "q" parameter.

Second, a long results list comes from here:

http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini

These both are XML documents; you might have better luck parsing these URLs with Hpricot.

0

精彩评论

暂无评论...
验证码 换一张
取 消