开发者

RSS screen scraper

开发者 https://www.devze.com 2022-12-21 12:36 出处:网络
Can any开发者_JAVA技巧one point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?There\'s a good list of them here, which mentions Feed Parser, whic

Can any开发者_JAVA技巧one point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?


There's a good list of them here, which mentions Feed Parser, which you use like this:

import feedparser

python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
                      "RecentChanges?action=rss_rc"

feed = feedparser.parse( python_wiki_rss_url )

You can then do things like:

for item in feed["items"]:
    print item["title"]


feedparser.org is great


Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.

Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.

http://tinyurl.com/yh3s9pa

You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号