(Neophyte post, apologies and thanks up front!)
My goal is to build a small app that monitors and parses a set of blogs' posts for outbound links, so I can then:
- Display top linked-to articles among the blogs in one frame; and,开发者_Go百科
- For a given linked-to article, display the posts (in my blogosphere) that link to it.
So far my idea is to use:
- Python (with Django or some-such front end) - Feedparser to read feeds and extract links from posts - URLparseThe Big Question: am I missing anything obvious that would make this way easier?
Smaller question (that I can't figure out yet):
- Since outbound link URLs may differ even when pointing to the same article (NYT URLs and tinyURLs, for example), how can I check a URL to see if it already in my list of linked-items beyond just comparing the absolute URL?This SO post was helpful at a high level, but parsing 'blogroll'-style link lists seems a lot easier than actively comparing URLs within a post, particularly to news sites that may do all sorts of funny things in their URLs.
I would go for the same setup. You'll probably need lxml to parse and manipulate the post content HTML (extract a tags).
精彩评论