开发者

Parsing blog posts for common links

开发者 https://www.devze.com 2023-04-04 08:13 出处:网络
(Neophyte post, apologies and thanks up front!) My goal is to build a small app that monitors and parses a set of blogs\' posts for outbound links, so I can then:

(Neophyte post, apologies and thanks up front!)

My goal is to build a small app that monitors and parses a set of blogs' posts for outbound links, so I can then:

  1. Display top linked-to articles among the blogs in one frame; and,开发者_Go百科
  2. For a given linked-to article, display the posts (in my blogosphere) that link to it.

So far my idea is to use:

- Python (with Django or some-such front end)

- Feedparser to read feeds and extract links from posts

- URLparse

The Big Question: am I missing anything obvious that would make this way easier?

Smaller question (that I can't figure out yet):

- Since outbound link URLs may differ even when pointing to the same article (NYT URLs and tinyURLs, for example), how can I check a URL to see if it already in my list of linked-items beyond just comparing the absolute URL?

This SO post was helpful at a high level, but parsing 'blogroll'-style link lists seems a lot easier than actively comparing URLs within a post, particularly to news sites that may do all sorts of funny things in their URLs.


I would go for the same setup. You'll probably need lxml to parse and manipulate the post content HTML (extract a tags).

0

精彩评论

暂无评论...
验证码 换一张
取 消