This is a creative one :-)
I'll be receiving a list of hundreds of new URLs regularly and want to find out if they are linkin开发者_如何学运维g to a blog or not - between 80% and 95% accuracy would be sufficient.
Obviously I need to analyze the HTML of the page - but how exactly would you approach this (e.g. meta tags, structural analysis, pattern matching, machine learning ...)?
I would look at the generator <meta>
tag for known blog editors. For example here's how it looks for Wordpress:
<meta name="generator" content="WordPress.com" />
Building on Darin's solution, I would look for the generator <meta>
tag for known blog editors and combine it with a lookup table of common sites, ie. WordPress.com
, Blogspot.com
, Livejournal.com
, and so forth. That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time.
An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog:
- Is it hosted by a blogging service provider?
- Is it listed in a blog aggregator, such as Technorati?
- Does it include blog-like services, such as user-generated articles, tags, and the ability to comment?
- Does it provide meta information that I can use to easily identify it as a blog?
- Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria?
I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
The above suggestions are good, and probably will work if you're aiming for 80-90% accuracy.
I would go one step further and look for any .xml RSS feed in either a meta tag, or as a link. Then check the feed to see if there are any comment tags (since there are feeds for other purposes too). I would OMIT this for certain blog platforms that don't give you a feed such as Tumblr.
精彩评论