I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrar开发者_如何学编程y websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.
I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/
You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.
im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure
I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.
精彩评论