开发者

extracting useful data from arbitary html pages?

开发者 https://www.devze.com 2022-12-18 20:22 出处:网络
is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....开发者_如何转开发should use some sort of text mining to identify

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....开发者_如何转开发should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...


I'm a PHP guy, no idea about Ruby but I think that what you want is trivial to archive:

  • Use something like Simple HTML DOM to parse the pages.
  • For each page compare all the DOM elements.
  • Get the path of all elements that have different content, those will be your signal elements.
0

精彩评论

暂无评论...
验证码 换一张
取 消