scraping blog contents_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-04 05:50 出处：网络

相关专题：python

After obtaining the urls for various blogspots, tumblr and wordpress pag开发者_如何学Pythones, I faced some problems processing the html pages. The thing is, i wish to distinguish between the content,title and date for each blog post. I might be able to get the date through regex, but there are so many custom scripts people are using now that the html classes and structure is so different.

Does anyone has a solution that may help?

If at all feasible, use the blogs' RSS or Atom feeds instead -- they're well-structured XML, rather than not-so-well structured HTML, and Universal Feed Parser is enormously helpful at getting to feeds' contents in Python.

If some blog lacks a feed (or the feed is really scarce), so you have to parse its HTML (sigh!), the best approach is BeautifulSoup (use the latest 3.0.*, not a 3.1 -- for why, see here) - not the fastest, but the most resilient in front of very badly formed HTML (and the same kind of blog that lacks a feed, I suspect, may be liable to have rotten HTML). lxml, the library @Hank recommends, does include a copy of BeautifulSoup I believe, but, if that's all you're going to get, why go to the bother of installing the whole when you only need a part?-)

Don't use regex. Use a parser. lxml is really fast.

Actually, if your sites publish atom or rss feeds, parse those instead; they have well-defined structure that makes it easy to get the data you're trying to get.

UPDATE:

Often times, you can find a <link> to the feed in the HTML of the blog post. Look for something resembling the following (the exact value of type is likely to vary depending on Atom vs. RSS, etc.):

<link rel="alternate" type="application/atom+xml" title="My Weblog feed" href="/feed/" />

in the <head> of the document. If you find a feed, use the Universal Feed Parser, as @Alex Martelli recommends.

Oh, and you may want to watch this PyCon video.

I think you should change your approach. Instead of parsing the html page, why not parse the RSS feed? Wordpress has this built in, and it already contains the info you need such as titles, author, dates etc.

You can still use regex for parsing the RSS feeds or you can use existing python modules such as Universal Feed Parser