I'm pretty ignorant of what appears in the html/javascript of a website because I spend most of my time on the back-end (phrasing!). Basically, I want to know the best way to take a company's url, e.g. PETA, and from that url parse out descriptive words about the company from their front开发者_如何转开发-page html. This way you can jump-start an auto-tagging categorization website with just a list of company urls.
If this is reasonable, any recommendations for tools/processes to find/mine the content would be much welcomed.
And if not or you have a better idea to get the tags, let it be known as well!
Mike Swift is too correct -- if you're looking for categorization only, then all you need to do is parse out DMOZ categorizations. The amazon service uses DMOZ to get the categories anyway, and it's free (unlike AWIS). For example, parse out this link to get the categories for PETA.
If you're looking for parsing tools, I've quite enjoyed Nokogiri, but any web-parsing tool like BeautifulSoup works. I would parse it with something like:
Nokogiri::HTML(open('<site>'))
doc.css('ol.dir li a').map {|item| [item.content]}
Hope that helps!
Why not just use the Alexa Webinfo API? It's easy to use and you can get the keywords as well as a lot of useful information about the link. (Plus it's part of AWS which means good speed and reliability)
General Info & Signup
http://aws.amazon.com/awis/
Docs:
http://docs.amazonwebservices.com/AlexaWebInfoService/latest/
Code Samples:
http://aws.amazon.com/code?_encoding=UTF8&jiveRedirect=1
精彩评论