开发者

Generic Article Extraction from web pages

开发者 https://www.devze.com 2023-01-24 07:06 出处:网络
Am going to begin my work in article extraction. The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg.1. http://www.tripadvisor.ca/Hotel_Review-g3264

Am going to begin my work in article extraction.

The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )

I need to do the task in Java and I am just working with Java for the past couple of months alone..

And here comes my questions regarding these.

  1. Is there possibility to extract reviews alone from different web pages in a generic way.

  2. Kindly let me know if there are any API开发者_StackOverflow社区 that supports the task in Java.

  3. Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.

UPDATE

If any sort of related examples available in net, please post the same since that could be of great use.


You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.

However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

0

精彩评论

暂无评论...
验证码 换一张
取 消