is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....开发者_如何转开发should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...
I'm a PHP guy, no idea about Ruby but I think that what you want is trivial to archive:
- Use something like Simple HTML DOM to parse the pages.
- For each page compare all the DOM elements.
- Get the path of all elements that have different content, those will be your signal elements.
精彩评论